From DNA discovery to data-driven biology
Over the past two decades, advances in sequencing technologies, computational biology, and artificial intelligence (AI) have fundamentally reshaped life sciences. Modeling, simulations, and insights that previously required years of effort can now be achieved in a fraction of the time.
Researchers are moving from isolated experiments to integrated, data-driven systems that combine genomics, transcriptomics, and other biological data layers to better understand disease and design new therapies.
This shift is enabling breakthroughs across many fields, including next-generation precision medicine, drug discovery, infectious disease research, and biosecurity.
This new era is not without risk, however, as it introduces a new dependency: the quality of the underlying data.
The growing risk of unvalidated genomic data
As the volume of genomic data continues to grow, so too does variability in data quality. Today, there are data from thousands of labs. Curation is mostly automated, and reference genomes are rarely retrospectively updated even if methods and standards change.
As a result, public genomic datasets are often inconsistent in:
- How they are generated and validated
- Metadata completeness and consistency
- The use of standardized terminology
- Traceability to original biological materials
Over a quarter of foodborne microbiological samples in the public sequence database are missing key metadata attributes.1 |
When low-quality or poorly characterized data are used to train computational models, it can lead to flawed predictions, irreproducible results, and wasted time and resources. In a research environment increasingly driven by AI and large-scale data analysis, these risks are amplified.
Why reproducibility depends on trusted genomic data
Reproducibility has long been a cornerstone of scientific progress. For genomic data to be truly reliable, researchers must be able to answer three fundamental questions:
- Where did these data come from?
- How was it generated and validated?
- Can it be traced back to a known, authenticated biological source?
Without clear answers, even the most sophisticated analyses can rest on uncertain foundations.
This is why traceability, the ability to link digital data back to physical biological materials, is so imperative today. It provides a path to validate findings, replicate experiments, and ensure that insights derived from data reflect real-world biology.
ATCC’s role in setting and advancing the standard for quality
For more than a century, ATCC has supported the global scientific community by providing authenticated biological materials and establishing standards for quality, consistency, and traceability.
That role is more important than ever.
As biology becomes increasingly digital, the need for trusted inputs has expanded beyond physical materials to include the data derived from them. ATCC’s approach is grounded in the same principles that have guided its work for decades: authentication, standardization, and rigorous quality control.
We provide authenticated physical material coupled with reference-quality genome sequences. Data are fully traceable and authenticated to ATCC materials, and all genome assemblies are produced in-house at ATCC in an ISO-certified laboratory. More than 98% of our assemblies were proven more complete and of higher quality than NCBI RefSeq bacterial assemblies.2
This comprehensive approach creates a stable foundation for downstream research, whether conducted with physical materials or digital counterparts.
The ATCC Genome Portal: A key to trusted digital biology
Extending this commitment, the ATCC Genome Portal provides researchers with access to one of the world’s largest sets of curated, reference-quality genomic data—currently including 6,750 genomes, 950 exomes, and 3,000 transcriptomes—derived from authenticated microbes and cell lines. This is a rapidly expanding dataset as new datasets are added on an ongoing basis.
Unlike many publicly available –omics datasets, those in the ATCC Genome Portal are directly linked to physical source materials, enabling a clear line of traceability from digital data back to the original biological sample. Each dataset is supported by standardized metadata and quality metrics, giving researchers greater confidence in how the data was generated and how it can be used.
The result is a resource designed not just for access, but for reliability as it can support more consistent analyses, stronger models, and more reproducible outcomes.
Looking ahead
DNA Day is an opportunity to reflect on how far science has come. It is also a moment to look forward.
As biology continues to evolve into a data-driven discipline, the next era of discovery will be shaped not just by how much data we generate, but by how much we can trust it. Ensuring the quality, integrity, and traceability of genomic data will be critical to unlocking the full potential of digital biology and to advancing science in a way that is both innovative and responsible.
Learn more about how the ATCC Genome Portal is supporting high-quality, reproducible genomic research.
Did you know?
The ATCC Genome Portal can support comparative genomics studies, biomarker or genetic variant discovery, artificial intelligence model training, and much more.
Meet the author
Jonathan Jacobs, PhD
Senior Director of Bioinformatics, ATCC
Dr. Jonathan Jacobs leads ATCC’s Sequencing & Bioinformatics Center and the development of the ATCC Genome Portal. He has over 20 years of experience in molecular genetics, bioinformatics, and microbial genomics, and he has worked throughout his career at the interface of academia, government, and industry. He holds a joint Research Professor appointment at Syracuse University’s Forensic & National Security Sciences Institute in support of microbial forensics graduate student training and research, and he actively collaborates with several US public health laboratories involved in pathogen genomics research and surveillance. Dr. Jacobs is also certified in Product Management from Pragmatic Institute, and he has led successful commercial launches of several bioinformatics products into the market.
Explore our featured resources
Discover the ATCC Genome Portal
The ATCC Genome Portal is a rapidly growing ISO 9001–compliant database of high-quality reference genomes from authenticated microbial strains in the ATCC collection. Through this cloud-based platform, you can easily access and download meticulously curated whole-genome sequences from your browser or our secure API. With high-quality, annotated data at your fingertips, you can confidently perform bioinformatics analyses and make insightful correlations.
MoreThe ATCC Genome Portal: Our Approach to Cell Line Whole-Exome and RNA Sequencing
Explore ATCC’s process for generating high-quality whole-exome and RNA sequencing data with details on extraction, sequencing, bioinformatics methods, and rigorous quality control standards.
More
Technical documentThe ATCC Genome Portal: Our Approach to Microbial Whole-Genome Sequencing
Discover the features of the ATCC Genome Portal and understand the DNA extraction, sequencing, and bioinformatic methods we use to produce high-quality, reference-grade genomes.
MoreReferences
- Pettengill JB, et al. Interpretative labor and the bane of non-standardized metadata in public health surveillance and food safety. Clin Infect Dis 73(8): 1537-1539, 2021. PubMed: 34240118
- Yarmosh DA, Lopera JG, Puthuveetil NP, et al. Comparative Analysis and Data Provenance for 1,113 Bacterial Genome Assemblies. mSphere 7(3): e0007722, 2022. PubMed: 35491842