• Quick Order
  • Careers
  • Support

Translational Ramifications of Crowd-sourced Genomics Data

Blue DNA strand with sides made of light blue balls.

Lake Arrowhead Microbial Genomics (LAMG) 2022

Lake Arrowhead, California, United States

September 12, 2022


Background: Public genomics databases serve a critical role in the life science research community. Despite existing guidelines, which require metadata associated with a given genome assembly, other relevant data (eg, sequencing platform, assembly method) are often incomplete or missing. Ultimately, this gap renders the assembly data itself questionable from the perspective of reliability, traceability, and accuracy. Previously, Yarmosh et al. illustrated the impact of poor data provenance by comparing several publicly available assemblies to assemblies that had complete traceability. While it was found that some public assemblies were labeled as derivatives of ATCC source material, there was a tendency toward fewer relative variants between these assemblies and their ATCC Standard Reference Genome (ASRG) counterparts.  However, several of these assemblies still contained a large quantity of variants, including those inducing translational changes.

Methods: To better understand the consequences outlined in a previous study, the Prokaryotic Genome Annotation Pipeline was run on 190 public assemblies labeled as ATCC type material and their 127 corresponding ASRGs. Annotations from both sets were compared in terms of amino acid identity, gene count, gene identity, and the gain/loss of stop codons. 

Results: Despite the claim of being assembled from type material, 25 of 190 public genomes contain premature stop codons and over 35,000 of these annotations have less than 50% reciprocal identity relative to their ASRGs. 

Discussion: Public genomics databases are unable to curate their immense library of submitted assembly data to ensure the utmost quality. The de facto usage of these databases in modern research coupled with the repercussions of incomplete metadata underscores the urgent need for a more stringent curation process, such that future research and public health are not burdened by unreliable data. Toward this goal, ATCC’s initiative to enhance the authentication of our products aims to provide high-quality reference genomic data directly from ATCC source material.

Download the poster to learn about the importance of data provenance and how ATCC is addressing this issue



David Yarmosh, headshot.

David Yarmosh, MS

Lead Bioinformatician, ATCC

David Yarmosh is a senior bioinformatician in ATCC’s Sequencing and Bioinformatics Center. He’s a graduate of New York University’s Tandon School of Engineering. He has been working in large data aggregation and analysis since 2013 and microbial genomics with a focus on biosurveillance R&D efforts since 2016. David has led international training exercises in Peru and Senegal, sharing metagenomic analytical capabilities. His interests include genomics database construction, metadata collection, drug resistance mechanisms, bioinformatics standards, and machine learning. Since joining ATCC in 2020, he has helped develop the podcast Behind the Biology, which he now hosts.

DNA rods with bacteria.

Reference-quality sequences

Through the ATCC Genome Portal, you can easily search, access, and analyze thousands of reference-quality genome sequences. Our optimized methodology is designed to achieve complete, circularized (when biologically appropriate), and contiguous genomic elements by using short-read (virology collection) and hybrid (bacteriology, mycology, and protistology collections) assembly techniques. We then take our workflow one step further by accompanying each stage of the process with rigorous quality control analyses that ensure the highest quality data. Only the data that passes all quality control criteria are published to the ATCC Genome Portal. Visit the portal today to find the high-quality data you need for your research.

Visit the portal