Translational Ramifications of Crowd-sourced Genomics Data
Lake Arrowhead Microbial Genomics (LAMG) 2022
Lake Arrowhead, California, United StatesSeptember 12, 2022
Background: Public genomics databases serve a critical role in the life science research community. Despite existing guidelines, which require metadata associated with a given genome assembly, other relevant data (eg, sequencing platform, assembly method) are often incomplete or missing. Ultimately, this gap renders the assembly data itself questionable from the perspective of reliability, traceability, and accuracy. Previously, Yarmosh et al. illustrated the impact of poor data provenance by comparing several publicly available assemblies to assemblies that had complete traceability. While it was found that some public assemblies were labeled as derivatives of ATCC source material, there was a tendency toward fewer relative variants between these assemblies and their ATCC Standard Reference Genome (ASRG) counterparts. However, several of these assemblies still contained a large quantity of variants, including those inducing translational changes.
Methods: To better understand the consequences outlined in a previous study, the Prokaryotic Genome Annotation Pipeline was run on 190 public assemblies labeled as ATCC type material and their 127 corresponding ASRGs. Annotations from both sets were compared in terms of amino acid identity, gene count, gene identity, and the gain/loss of stop codons.
Results: Despite the claim of being assembled from type material, 25 of 190 public genomes contain premature stop codons and over 35,000 of these annotations have less than 50% reciprocal identity relative to their ASRGs.
Discussion: Public genomics databases are unable to curate their immense library of submitted assembly data to ensure the utmost quality. The de facto usage of these databases in modern research coupled with the repercussions of incomplete metadata underscores the urgent need for a more stringent curation process, such that future research and public health are not burdened by unreliable data. Toward this goal, ATCC’s Enhanced Authentication Initiative aims to provide high-quality reference genomic data directly from ATCC source material.
Download the poster to learn about the importance of data provenance and how ATCC is addressing this issueDownload
David Yarmosh, MS
Senior Bioinformatician, ATCC
David Yarmosh is a senior bioinformatician in ATCC’s Sequencing and Bioinformatics Center. He’s a graduate of New York University’s Tandon School of Engineering. He has been working in large data aggregation and analysis since 2013 and microbial genomics with a focus on biosurveillance R&D efforts since 2016. David has led international training exercises in Peru and Senegal, sharing metagenomic analytical capabilities. His interests include genomics database construction, metadata collection, drug resistance mechanisms, bioinformatics standards, and machine learning. Since joining ATCC in 2020, he has helped develop the podcast Behind the Biology, which he now hosts.
Through the ATCC Genome Portal, you can easily search, access, and analyze hundreds of reference-quality genome sequences. Our optimized methodology is designed to achieve complete, circularized (when biologically appropriate), and contiguous genomic elements by using short-read (viruses) and hybrid (bacteria and fungi) assembly techniques. We then took our workflow one step further by accompanying each stage of the process with rigorous quality control analyses that ensure our data are the highest quality possible. Only the data that passes all quality control criteria are published to the ATCC Genome Portal. Visit the portal today to find the high-quality data you need for your research.Visit the portal