An End-to-End Pipeline for Characterization and Annotation of Traceable Bacterial Material
ASM Conference on Rapid Applied Microbial Next-Generation Sequencing and Bioinformatic Pipelines (ASM NGS 2022)
Baltimore, Maryland, United StatesOctober 18, 2022
The need for well-characterized quality genomics data is crucial for life science research. Laboratories often leverage publicly available data as a cornerstone for their experimental design. While such databases have grown exponentially via contributions from the scientific community, data provenance is often lacking, and authenticity of the underlying materials is not assured.
American Type Culture Collection (ATCC) has discussed this issue before—Yarmosh, D. A. et al. Comparative Analysis and Data Provenance for 1,113 Bacterial Genome Assemblies. mSphere 7, e00077-22 (2022)—and has developed the ATCC Genome Portal and ongoing whole-genome sequencing (WGS) initiative to address this problem; producing genomics data that can be traced back to the source material. Since the source material is taken straight from ATCC’s repository, it has been authenticated and subjected to minimal passaging, which can cause lab-induced mutations.
There are several approaches for bioinformatics pipelines to assemble and annotate WGS data; however, the processing of dozens of such assemblies in an automated, end-to-end fashion is often not discussed in detail. Here, we present our own methodology that uses a hybrid of Illumina and Oxford Nanopore (ONT) sequencing technologies and compares results to earlier internal assemblies as well as publicly available versions labelled as ATCC genomes.
In brief, materials in the repository are sequenced with Illumina MiSeq or NextSeq along with ONT GridION instruments. The FASTQs are checked for quality and filtered to remove low quality reads and adapters for Illumina and a minimum read length for ONT. Kraken2 is used to classify each read to check for contamination and to bin the reads to the appropriate taxonomic group. Reads are down sampled and assembled via Unicycler. Resulting contigs are further checked for quality and coverage statistics are generated. The NCBI Prokaryotic Genome Annotation Pipeline (PGAP, installed locally) is used to annotate the genomes. Post-processing QC involves evaluating assembly statistics, completeness, improper assembly artifacts (such as inversions and unexpected repeats) and BLAST for a verification of genome identification (as some bacterial species can be very closely related).
This pipeline ensures that the genomic data accurately represents the material sent by ATCC with traceability all the way to the original deposited source.
Download the poster to learn about our bioinformatics pipeline for the characterization and annotation of microbial strainsDownload
Watch the poster presentation
John Bagnoli, BS
Senior Manager, Bioinformatics, ATCC
John Bagnoli is a Senior Manager for Bioinformatics in the Sequencing and Bioinformatics Center (SBC) at ATCC. Prior to joining ATCC, he held positions at QIAGEN and MRIGlobal where he gained extensive experience in robotics, laboratory automation, oligonucleotide manufacturing, and bioinformatics. Mr. Bagnoli has a Bachelor of Science in Biochemical Pharmacology from the University at Buffalo.
Through the ATCC Genome Portal, you can easily search, access, and analyze thousands of reference-quality genome sequences. Our optimized methodology is designed to achieve complete, circularized (when biologically appropriate), and contiguous genomic elements by using short-read (viruses) and hybrid (bacteria and fungi) assembly techniques. We then took our workflow one step further by accompanying each stage of the process with rigorous quality control analyses that ensure our data are the highest quality possible. Only the data that passes all quality control criteria are published to the ATCC Genome Portal. Visit the portal today to find the high-quality data you need for your research.Visit the portal