What we found
Our analysis showed that while differences in sequencing and assembly methods will affect the results to some degree, the number and types of variations we found suggested that methodological differences were not the only factor. We believe that these differences are due to various laboratories and institutions performing sequencing years after purchasing the material from ATCC, allowing time for adaptation, contamination, or other issues to arise. These variations in technology, methodology, and time are likely responsible for how multiple GenBank assemblies of the same material have different genomic sequences.
Comparative Metrics for 1,113 ATCC Standard Reference Genomes (ASRGs) vs. RefSeq Assemblies. (A) Intersection of ASRGs vs. RefSeq for strains labeled as being from ATCC. In parentheses are the total number of RefSeq assemblies, allowing for strain redundancy. (B) N50 variability of RefSeq vs. ASRGs by sequencing technology. Note the scale is 1×106. (C) Differences in contig counts for ASRG vs. RefSeq assemblies. Positive values indicating the RefSeq assembly had more contigs. (D) Ratios of ASRG N50 values (y-axis) to RefSeq N50 values (“public,” x-axis). Density along the diagonal indicates many assemblies are similar while the density along the y-axis indicates ASRGs with higher N50 value. (E) GC% for ASRGs (y-axis) to RefSeq (x-axis). Nearly all assemblies have less than 0.1% difference in GC content. (F) Pairwise GC% differences between ASRGs and comparable RefSeq assemblies for the same strain.
Why it’s important
When bacteria are serially passaged, strains can acquire adaptive mutations over time that result in phenotypic variability and genome-level differences as compared to the original material.
A recent study by Artuso et al that illustrates this point caught our attention. In this study, titled “Genome diversity of domesticated Acinetobacter baumannii 19606T strains,” the team raised significant concerns regarding current laboratory practices surrounding the use of strains labeled as (ATCC 19606) but whose genomic data differ dramatically from that found on the ATCC Genome Portal.
Unlike the broad manuscript ATCC has produced, Artuso et al inspected specific variations and suggested that most of these variations are directly tied to multiple culture passages prior to sequencing. In a dramatic example, a 52 kb prophage was lost after serial culturing in some labs while it is present in material originating directly from ATCC. Consequently, these labs are using reference sequence information that does not reflect their existing stocks.
What’s the solution?
While Artuso et al resolve that A. baumannii strains in different labs likely exhibit significant diversity from comparably generic reference sequences and need to be treated as different strains, there is another argument to be made from this research: serial culturing comes with distinct risk. Reference genomic sequences can only reflect the precise strain they were drawn from, which further reflects the strains’ source and culture conditions. Changes to these conditions or repeated passages can induce expected or unexpected variations to the strain identity. These variations can reverberate throughout the academic research conducted using a serially cultured strain. Indeed, one could interpret the team’s conclusion regarding the evolutionary history of these stocks as calling into question the validity of any research using these stocks.
Both the study by Artuso et al and our study are microcosms of a larger emerging problem: the chain of custody from biological material to omics data is often incomplete or missing, and it must be maintained and reported so that ensuing research is well founded, authenticated, and valid. The solution is straightforward—start your research using authenticated material from ATCC to make sure your research results are sound and reproducible.
Read the research:
Comparative Metrics for 1,113 ATCC Standard Reference Genomes (ASRGs) vs. RefSeq Assemblies
Genome diversity of domesticated Acinetobacter baumannii 19606T strains
David Yarmosh, MS
Lead Bioinformatician, ATCC
David Yarmosh is a lead bioinformatician in ATCC’s Sequencing and Bioinformatics Center. He’s a graduate of New York University’s Tandon School of Engineering. He has been working in large data aggregation and analysis since 2013 and microbial genomics with a focus on biosurveillance R&D efforts since 2016. David has led international training exercises in Peru and Senegal, sharing metagenomic analytical capabilities. His interests include genomics database construction, metadata collection, drug resistance mechanisms, bioinformatics standards, and machine learning. Since joining ATCC in 2020, David has worked extensively in SARS-CoV-2 classification, epidemiology, and genomics evaluation, including enhanced and uniform variant reporting. He has contributed more broadly to genomics reporting and analytical standardization and he has helped develop the podcast Behind the Biology, which he now hosts.
Ford Combs, PhD
Bioinformatician, Sequencing and Bioinformatics Center, ATCC
Ford Combs is a new member of ATCC's Sequencing and Bioinformatics Center, having joined in January 2021. As a bioinformatician, he primarily works on ATCC's internal sequencing projects by either assembling and analyzing data or testing and improving bioinformatics pipelines. As the Audio Engineer on ATCC's Podcast, Behind the Biology, Ford performs sound design and audio editing. He holds an MS and PhD in bioinformatics and computational biology from George Mason University. His dissertation focused on topological and machine learning-based approaches to protein secondary structure assignment.
Dig deeper into authentication and genomic sequencing
Discover the ATCC Genome Portal
The ATCC Genome Portal is a rapidly growing ISO 9001–compliant database of high-quality reference genomes from authenticated microbial strains in the ATCC collection. Through this cloud-based platform, you can easily access and download meticulously curated whole-genome sequences from your browser or our secure API. With high-quality, annotated data at your fingertips, you can confidently perform bioinformatics analyses and make insightful correlations.
MoreMycoplasma Contamination
Mycoplasmas frequently contaminate cell cultures. Discover how rapid and sensitive detection can prevent the costly effects of mycoplasma on your research projects.
MoreRepairing Reproducibility
Reproducibility is an urgent problem. Explore what your colleagues believe is the source of the issue and what scientists can do to solve it.
MoreReferences
1. Comparative Analysis and Data Provenance for 1,113 Bacterial Genome Assemblies. David A. Yarmosh, Juan G. Lopera, Nikhita P. Puthuveetil, Patrick Ford Combs, Amy L. Reese, Corina Tabron, Amanda E. Pierola, James Duncan, Samuel R. Greenfield, Robert Marlow, Stephen King, Marco A. Riojas, John Bagnoli, Briana Benton, Jonathan L. Jacobs. bioRxiv 2021.12.14.472616; doi: https://doi.org/10.1101/2021.12.14.472616
2. Artuso, Irene, Massimiliano Lucidi, Daniela Visaggio, Giulia Capecchi, Gabriele Andrea Lugli, Marco Ventura, and Paolo Visca. “Genome Diversity of Domesticated Acinetobacter Baumannii ATCC 19606T Strains.” Microbial Genomics,. 000749: Microbiology Society, 2022. https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000749.