From Double Helix to Digital Biology: The Data Quality Imperative

On April 25, we celebrate DNA Day, which marks two defining moments in modern science: the discovery of the DNA double helix in 1953 and the completion of the Human Genome Project in 2003. Together, these milestones transformed our understanding of life and set the stage for a new era of digital biology. The story of DNA has started a new chapter. Today, it’s not just about decoding DNA, but how those data are captured, curated, and applied to accelerate discovery.

From DNA discovery to data-driven biology

Over the past two decades, advances in sequencing technologies, computational biology, and artificial intelligence (AI) have fundamentally reshaped life sciences. Modeling, simulations, and insights that previously required years of effort can now be achieved in a fraction of the time.

Researchers are moving from isolated experiments to integrated, data-driven systems that combine genomics, transcriptomics, and other biological data layers to better understand disease and design new therapies.

This shift is enabling breakthroughs across many fields, including next-generation precision medicine, drug discovery, infectious disease research, and biosecurity.

This new era is not without risk, however, as it introduces a new dependency: the quality of the underlying data.

The growing risk of unvalidated genomic data

As the volume of genomic data continues to grow, so too does variability in data quality. Today, there are data from thousands of labs. Curation is mostly automated, and reference genomes are rarely retrospectively updated even if methods and standards change.

As a result, public genomic datasets are often inconsistent in:

How they are generated and validated
Metadata completeness and consistency
The use of standardized terminology
Traceability to original biological materials

Over a quarter of foodborne microbiological samples in the public sequence database are missing key metadata attributes.¹

When low-quality or poorly characterized data are used to train computational models, it can lead to flawed predictions, irreproducible results, and wasted time and resources. In a research environment increasingly driven by AI and large-scale data analysis, these risks are amplified.

Why reproducibility depends on trusted genomic data

Reproducibility has long been a cornerstone of scientific progress. For genomic data to be truly reliable, researchers must be able to answer three fundamental questions:

Where did these data come from?
How was it generated and validated?
Can it be traced back to a known, authenticated biological source?

Without clear answers, even the most sophisticated analyses can rest on uncertain foundations.

This is why traceability, the ability to link digital data back to physical biological materials, is so imperative today. It provides a path to validate findings, replicate experiments, and ensure that insights derived from data reflect real-world biology.

ATCC’s role in setting and advancing the standard for quality

For more than a century, ATCC has supported the global scientific community by providing authenticated biological materials and establishing standards for quality, consistency, and traceability.

That role is more important than ever.

As biology becomes increasingly digital, the need for trusted inputs has expanded beyond physical materials to include the data derived from them. ATCC’s approach is grounded in the same principles that have guided its work for decades: authentication, standardization, and rigorous quality control.

We provide authenticated physical material coupled with reference-quality genome sequences. Data are fully traceable and authenticated to ATCC materials, and all genome assemblies are produced in-house at ATCC in an ISO-certified laboratory. More than 98% of our assemblies were proven more complete and of higher quality than NCBI RefSeq bacterial assemblies.²

This comprehensive approach creates a stable foundation for downstream research, whether conducted with physical materials or digital counterparts.

The ATCC Genome Portal: A key to trusted digital biology

Extending this commitment, the ATCC Genome Portal provides researchers with access to one of the world’s largest sets of curated, reference-quality genomic data—currently including 6,750 genomes, 950 exomes, and 3,000 transcriptomes—derived from authenticated microbes and cell lines. This is a rapidly expanding dataset as new datasets are added on an ongoing basis.

Unlike many publicly available –omics datasets, those in the ATCC Genome Portal are directly linked to physical source materials, enabling a clear line of traceability from digital data back to the original biological sample. Each dataset is supported by standardized metadata and quality metrics, giving researchers greater confidence in how the data was generated and how it can be used.

The result is a resource designed not just for access, but for reliability as it can support more consistent analyses, stronger models, and more reproducible outcomes.

Looking ahead

DNA Day is an opportunity to reflect on how far science has come. It is also a moment to look forward.

As biology continues to evolve into a data-driven discipline, the next era of discovery will be shaped not just by how much data we generate, but by how much we can trust it. Ensuring the quality, integrity, and traceability of genomic data will be critical to unlocking the full potential of digital biology and to advancing science in a way that is both innovative and responsible.

Learn more about how the ATCC Genome Portal is supporting high-quality, reproducible genomic research.

Did you know?

The ATCC Genome Portal can support comparative genomics studies, biomarker or genetic variant discovery, artificial intelligence model training, and much more.

Meet the author

Jonathan Jacobs, PhD

Senior Director of Bioinformatics, ATCC

Dr. Jonathan Jacobs leads ATCC’s Sequencing & Bioinformatics Center and the development of the ATCC Genome Portal. He has over 20 years of experience in molecular genetics, bioinformatics, and microbial genomics, and he has worked throughout his career at the interface of academia, government, and industry. He holds a joint Research Professor appointment at Syracuse University’s Forensic & National Security Sciences Institute in support of microbial forensics graduate student training and research, and he actively collaborates with several US public health laboratories involved in pathogen genomics research and surveillance. Dr. Jacobs is also certified in Product Management from Pragmatic Institute, and he has led successful commercial launches of several bioinformatics products into the market.

Explore our featured resources

Concentric circles with purple, orange and yellow markers for DNA sequencing.

Discover the ATCC Genome Portal

The ATCC Genome Portal is a rapidly growing ISO 9001–compliant database of high-quality reference genomes from authenticated microbial strains in the ATCC collection. Through this cloud-based platform, you can easily access and download meticulously curated whole-genome sequences from your browser or our secure API. With high-quality, annotated data at your fingertips, you can confidently perform bioinformatics analyses and make insightful correlations.

Technical document

The ATCC Genome Portal: Our Approach to Cell Line Whole-Exome and RNA Sequencing

Explore ATCC’s process for generating high-quality whole-exome and RNA sequencing data with details on extraction, sequencing, bioinformatics methods, and rigorous quality control standards.