The Elephant in the Room: Microbial Reference Genome Authentication and Traceability

DNA rods with bacteria.

World Microbe Forum

Virtual Event

June 21, 2021


Publicly available microbial genomics databases do not provide end-to-end traceability that connects genome references to physical materials hosted within a culture collection. This gap is often overlooked but can become readily apparent when researchers discover critical incongruencies in metadata or inconsistencies in the reference genomes themselves. Despite this gap, the broader microbial genomics community continues to place a tremendous amount of trust in these databases. To begin to address this gap, the American Type Culture Collection (ATCC) has recently begun systematically sequencing everything within the microbial collection using strict quality assurance approaches to both sequencing methodologies and bioinformatics pipelines. To date, we have released over 1,200 genomes that are directly traceable back to physical materials within ATCC’s holdings. The data are available to the research community via the ATCC Genome Portal (, a F.A.I.R. database of reference genomes including closed or high-quality genome assemblies. During the development of the ATCC Genome Portal, we surveyed the status and quality of existing reference genomes for ATCC strains that were already available in public databases. We identified several errors in both sample metadata and in genome assemblies for many genomes that are recognized as references. In this study, we present 100 bacterial strains with complete assemblies that were found in NCBI’s RefSeq database that were selected for analysis, based on strain name and publication impact. These were compared to our assemblies using the actual materials in the ATCC collection. All samples were sequenced using both Illumina and Oxford Nanopore technologies and processed through ATCC’s in-house hybrid assembly pipelines. Of the 100 strains evaluated, 35 were found to have more than 50 variants, 8 with over 100 variants, and 2 with over 50 kb in differences relative to those in RefSeq. In addition, 20 strains had more than one genome assembly available in NCBI’s Assembly database, three of which contained differences in plasmids reported with identically named ATCC strains. Here, we present an initial investigation into the attributable sources to these variations. We also provide recommendations on how public data repositories could potentially improve sequence verification methods so that these errors could be reduced, or potentially eliminated. Lastly, we present our roadmap to deliver high-quality, curated reference data via the ATCC Genome Portal.

Download the poster to explore the generation of high-quality, curated reference data



David Yarmosh, headshot.

David Yarmosh, MS

Senior Bioinformatician, ATCC

David Yarmosh is a senior bioinformatician in ATCC’s Sequencing and Bioinformatics Center. He’s a graduate of New York University’s Tandon School of Engineering. He has eight years of experience working in large data aggregation and analysis, five of which include microbial genomics with a focus on biosurveillance R&D efforts. David has led international training exercises in Peru and Senegal, sharing metagenomic analytical capabilities. His interests include genomics database construction, metadata collection, drug resistance mechanisms, bioinformatics standards, and machine learning.

DNA helix made of green and yellow puffy balls.

Reference-quality sequences

Through the ATCC Genome Portal, you can easily search, access, and analyze hundreds of reference-quality genome sequences. Our optimized methodology is designed to achieve complete, circularized (when biologically appropriate), and contiguous genomic elements by using short-read (viruses) and hybrid (bacteria and fungi) assembly techniques. We then took our workflow one step further by accompanying each stage of the process with rigorous quality control analyses that ensure our data are the highest quality possible. Only the data that passes all quality control criteria are published to the ATCC Genome Portal. Visit the portal today to find the high-quality data you need for your research.

Visit the portal