The ATCC Genome Portal: Our Approach to Whole-genome Sequencing

Technical Document

As life science research progresses, the quality of data becomes increasingly more important. As part of our initiative to enhance the authentication of our products, we aim to enrich the characterization of our biological collections by providing the whole-genome sequences of the specific, authenticated materials you need to generate credible data.

The purpose of this technical documentation is to outline the features of the ATCC Genome Portal and provide comprehensive descriptions of the DNA or RNA extraction, sequencing, and bioinformatic methods we use to produce high-quality, reference-grade genomes.

ATCC Genome Portal

The ATCC Genome Portal offers more than just a collection of reference-grade bacterial, viral, fungal, or protist genomes originating from authenticated ATCC materials. It is a platform where users can interactively browse genomic data and metadata.

Portal Features

Browse and download whole-genome sequences and annotations for a variety of ATCC products. The collection currently includes thousands of bacterial, fungal, viral, and protist genomes. Plus, we consistently release new assemblies every quarter.
Search for nucleotide sequences or genes within published genomes.
Search for genomes by taxonomic name, taxonomic level, isolation source, ATCC catalog number, type strain status, biosafety level, or certain tags. Current tag options include Type Strain and MSA Component.

Type strain: Strain with official standing in prokaryotic nomenclature
MSA Component: Component of an ATCC NGS Standard

View genome assembly statistics and quality metrics.
Identify the relatedness of published genomes by total genome alignment.
Purchase the corresponding authenticated ATCC source material.
Access the portal from the command line through an API (https://github.com/ATCC-Bioinformatics/genome_portal_api).

Our Approach to Genome Sequencing

After multiple decades of development in nucleic acid (DNA and RNA) sequencing, a plethora of techniques exist to sequence and assemble microbial genomes.^1-4 At ATCC, we are setting the scientific standard in best practices for whole-genome sequencing.

Recent innovations in second- and third-generation sequencing^5-7 have now made it possible to produce complete reference-grade microbial genomes and improve the assembly contiguity of large and highly heterozygous fungal genomes by combining highly accurate Illumina short reads with the revolutionary scaffolding ability of Oxford Nanopore Technologies (ONT) ultra-long reads via so-called hybrid assembly techniques^8,9 (for additional details see sections: Whole-Genome Sequencing and Genome Assembly).

The ATCC microbial whole-genome sequencing workflow is an optimized methodology designed to achieve complete, circularized (when biologically appropriate), and contiguous genomic elements by using short-read (RNA virology collection) and hybrid (bacteriology, mycology, and DNA virology collections) assembly techniques. This methodology comprises five primary steps:

Extraction of nucleic acids from authenticated ATCC strains
Sequencing of the nucleic acids
Assembly of sequencing data into a genome
Annotation of the resultant genome
Estimation of relatedness between a genome and all other genomes in our collection

Each step is accompanied by rigorous quality control methods and criteria to ensure that the data proceeding to the next step are the highest quality possible. Only the data that pass all quality control criteria are published to the ATCC Genome Portal. While ATCC materials undergo extensive quality control while being grown, a description of these processes is outside the scope of this document. For more information, see the product sheet for each product.

In the sections below, the methods and/or bioinformatic tools used to accomplish each step are described alongside relevant scientific citations supporting that approach. In addition, methods and/or bioinformatic tools used to measure quality control criteria are described alongside relevant scientific citations supporting the use of that measurement.

Nucleic Acid Extraction

High-quality DNA or RNA extraction is the critical starting point to creating a complete reference-grade genome. ATCC uses several proprietary protocols to obtain high-molecular-weight extractions from our microbial portfolio; the method chosen is dependent on the organism undergoing extraction.

Whole-Genome Sequencing

To generate the best quality sequencing data for our genome assemblies, we perform a single DNA or RNA extraction. ATCC uses both Illumina and Oxford Nanopore Technology (ONT) platforms for sequencing. See collection sections below for greater detail.

Illumina Sequencing

Illumina (DNA and RNA) libraries are prepared using the latest and most reliable library preparation kits available. Libraries are subsequently sequenced on an Illumina instrument (MiSeq^® or NextSeq2000^®), producing a paired-end read set per sample. The degree of sample multiplexing is based on the estimated genome size of a given organism and the amount of data necessary to generate at least 100X depth of the genome with the Illumina read set. Resultant reads are adapter trimmed using the adapter trimming option on the Illumina instrument. Periodic updates to the instruments’ software are performed when they are made available by the manufacturer to ensure that the latest version of instrument software is used for base-calling and adapter trimming for a given sequencing date.

Oxford Nanopore Technologies Sequencing

ONT libraries are prepared using the latest and most reliable DNA sequencing kits available, then sequenced on an ONT instrument (GridION) with the latest and most reliable flow cell version available. The degree of sample multiplexing is based on the estimated genome size of a given organism. Flow cells are run on the instrument for at least 48 hours. Periodic updates to the instruments’ software are performed when they are made available by the manufacturer to ensure that the latest version of ONT software is used for sequencing and base-calling for a given sequencing date.

After base-calling, all resultant FASTQs are combined and then demultiplexed using MinKNOW with the barcode removal settings turned on.

Illumina Data Quality Control

Illumina read sets commonly contain flanking low-quality regions and portions of Illumina adapter sequence; removing these regions can substantially improve genome assemblies.¹⁰ To accomplish this, we perform a second round of adapter and quality filtering using fastp. This also ensures the removal of adapter sequences otherwise missed by Illumina software. After Illumina read sets undergo quality and adapter trimming, we assess the quality of the read set by using FastQC. Illumina reads must pass the following quality control:

Median Q score, all bases > 30
Median Q score, per base > 25
Ambiguous content (% N bases) < 5%

Oxford Nanopore Technologies Data Quality Control

ONT ultra-long reads are critical for scaffolding over the low-complexity regions of bacterial and fungal genomes during hybrid assembly, but they have limited influence in determining base identity given enough Illumina depth.^7-9 Given the lower quality of ONT sequencing data, all data was trimmed and filtered for low quality regions. The quality control metrics used across all ONT read sets produced are:

Minimum mean Q score, per read > 10
Minimum read length > 1000 bp. To perform this quality control step, we employ Filtlong on demultiplexed ONT read sets in addition to barcode sequence removal during demultiplexing.

Read-Based Contamination Quality Control with One Codex

ATCC employs state-of-the-art methods to detect contamination during the growth phase of our product production. To complement this approach, we use the One Codex microbial genomics platform¹¹ to perform read-level k-mer–based¹² taxonomic classification and estimation of strain abundances on our processed Illumina read sets, which represent a highly-accurate snapshot of a given DNA extraction. A minimum of 1,000,000 Illumina reads per sequenced sample is required to undergo such analysis; Illumina read sets otherwise passing quality control criteria but possessing fewer than 1,000,000 reads are sent for re-sequencing. When an Illumina read set is confirmed as an isolate by the One Codex platform, all read sets from that extraction continue to genome assembly. Please note that the results of this reads-based analysis are not currently presented on the portal but that all published genomes have passed our stringent thresholds for purity.

View all whole-genome sequenced strains

Order Today

Genome Assembly

Hybrid Assembly

Hybrid assembly is a state-of-the-art technique that uses both highly accurate Illumina short reads and ultra-long scaffolding ONT reads.¹³ For bacterial assemblies, this technique uses Unicycler, which begins with an optimized Illumina assembly. The longest of these Illumina-based contigs are then assembled alongside the ONT reads; this combined assembly then undergoes multiple rounds of both long-read and short-read polishing.⁸ Because occasional sequencing and assembly artifacts appear as small contigs in the final assembly (so-called “chaff” contigs¹⁴), non-contiguous contigs less than 1000 bp with low relative depth are then removed to produce the final assembly. Please note that prior to bacterial genome assembly, the genome size is estimated using MASH, and the high-quality Illumina and ONT reads are down-sampled to 150X and 30X depth respectively based on estimated genome size.

For fungal assemblies, we down-sample the reads as for the bacterial assemblies, and then use the MaSuRCA (hybrid assembly algorithm combines Illumina and ONT reads to construct long and accurate mega-reads) pipeline with the Flye assembler.¹⁵ MaSuRCA was chosen for its strengths with large genomes.⁹

Genome Assembly Quality Control

Illumina Read Set Coverage

Although the depth of Illumina reads required is influenced by numerous factors (including, but not limited to, microbial strains),^16-17 Illumina read sets should be sufficient to cover the entire genome to obtain the most accurate base determination.¹⁸ To account for variance in distribution of coverage per base, we require a minimum of 100X average depth for Illumina reads.

Bacterial Completeness and Contamination

To ensure our bacterial assembly process has correctly captured the entirety of a given strain’s genome, and to confirm the absence of contamination from the assembly, we pass finalized assemblies through CheckM.¹⁹ Briefly, CheckM uses a set of Hidden Markov Models (HMMs) from phylogenetically close bacterial and archaeal reference genomes to determine if the query assembly contains all expected HMMs as predicted by the reference genomes (a percentage called “CheckM completeness”), and it evaluates what percent of the query’s HMMs differ in copy number or come from reference genomes that are phylogenetically distant (called “CheckM contamination”). We required final assemblies to have completeness values ≥ 95% and contamination values ≤ 5% (e.g., within the margin of error for 0% completeness and contamination, which indicates them as excellent reference sequences according to the authors of CheckM).

Mycology Completeness and Contamination

For mycology genomes, we estimate completeness using BUSCO.²⁰ BUSCO is a tool/database combo widely used in the mycology field that examines the presence of a selection of universal single-copy orthologs for quantitative completeness calculations. We use fungi-specific databases where orthologs must be identified in at least 90% of the fungal species, and no single copy ortholog can be entirely missing from any sub-clade in the databases. Unlike CheckM, BUSCO does not calculate % contamination. We require fungal assemblies to have a completeness value of ≥ 80%.

Viral Genome Assembly and Quality Assessment

As viruses are co-cultured with their host, viral DNA or RNA sequencing data may contain reads from both the host and the virus, and de novo assemblies may contain contigs from both the host and viral genome.²¹ In order to produce an assembly containing contigs of a single virus; host reads can be removed or contigs can be binned taxonomically.²² Taxonomic binning is performed by aligning reads or contigs to One Codex’s curated NCBI Reference Sequence Database. Reads or contigs that align to “cellular organisms” are binned as nonviral, while those that do not are binned as viral. For our approach, high-quality viral reads are used for de novo assembly using SPAdes.²³ To achieve the goal of obtaining complete assemblies for a single virus, the contig binning approach was used. Contigs that align to the Escherichia coli bacteriophage Phi-X 174 genome are excluded as this is used as a DNA spike-in for Illumina sequencing.

In addition to the problem of taxonomic binning, viral genomes are diverse in structure with many viruses having multipartite genome segments; the genome of Influenza A virus, for example, consists of 8 separate strands of RNA.²⁴ To determine whether an assembly contains all the necessary segments, a curated database of complete viral genomes and segment information was constructed. After taxonomic binning, contigs are then aligned to the Viral Genomes-NCBI-NIH database to apply segment labels, segment depth, and percent identity to the closest reference.

Genome Annotation

Bacteriology Genome Annotation

There are currently several approaches for bacterial genome annotations.^25-27 As such, we make our finalized genome assembly FASTA files available for download from our genome portal and encourage our customers to conduct their own custom annotations of the ATCC reference-grade genomes if they so choose. However, we also recognize the need for a rapidly accessible annotation in a common format for those looking to perform immediate data analysis at the gene level. To address these needs, we provide a default genome annotation for ATCC reference-grade genomes with NCBI’s Prokaryotic Genome Annotation Pipeline (PGAP).²⁶ While bacterial assemblies from 2019 to 2023 were initially annotated using Prokka,²⁵ new and existing assemblies are currently being annotated using NCBI’s PGAP. PGAP combines ab initio gene prediction algorithms with homology-based methods. PGAP leverages the Protein Family Models collection for structural and functional annotation. This collection is composed of Hidden Markov Model (HMM), Blast (BlastRules), and Conserved Domain Database-based architectures (CDDs) to assign names, gene symbols, publications, and EC number to the proteins that meet criteria for protein family inclusions. On the ATCC Genome Portal, all annotated CDSs include their EC number and UniProt ID as reported by PGAP.

Mycology Genome Annotation

During completeness calculations for mycology genomes, BUSCO²⁰ generates annotations of universal single-copy orthologs, which we make available in the genome portal. BUSCO uses Augustus (trained on BUSCO databases), tBLASTn, and HMMER3 to automatically predict and annotate single-copy coding regions of mycological genomes according to their closest relatives on fungi-specific databases.

Viral Genome Annotation and Variant Detection

Viral assemblies draw gene annotations from the closest reference sequence in Viral Genomes-NCBI-NIH databases by using a customized python script.

To call genomic variants, the depth-masked, SPAdes de novo assembly is aligned to a reference sequence using MAFFT,²⁹ which is a tool for multiple sequence alignments. Briefly, multiple sequence alignments are converted to a table of variants by using custom scripts. The table of variants is then joined to the reference assembly's genome annotation to produce a table of variants and their overlapping annotations. Variants that extend to the end of a segment sequence are excluded as they are likely truncations and not true biological variants. Variants that contain entirely ambiguous nucleotides in the reference or alternate sequence are also excluded from reporting.

Calculation of Assembly Level

Assemblies are ranked into different categories based on the NCBI Assembly Level definitions:

Bacteriology

Complete - Hybrid assemblies with ≥ 95% completeness as estimated by CheckM. All contigs are fully circularized.
Scaffold – Hybrid assemblies with ≥ 95% completeness as estimated by CheckM.
Contig – All remaining assemblies. Contig-level assemblies are not published in the genome portal.

Mycology

Complete - Hybrid assemblies with ≥ 80% completeness as estimated by BUSCO, with a single contig per chromosome.
Scaffold – Hybrid assemblies with ≥ 80% completeness as estimated by BUSCO.
Contig – All remaining assemblies. Contig-level assemblies are not published in the genome portal.

Virology

Complete – Assemblies with ≥ 80% completeness with all segments present and each represented by a single contig. Only complete assemblies are published in the genome portal.
Scaffold – Not used.
Contig – All remaining assemblies. Contig-level assemblies are not published in the genome portal.

Estimation of Genome Relatedness

ATCC’s reference-grade microbial genomes have even greater analytical power when considered in context of other closely related genomes in our database. To measure relatedness between our published genomes, we implement the most widely used approach: average nucleotide identity (ANI).³⁰ In this framework, ANI values greater than 95% between two genomes indicate that these genomes are derived from members of the same bacterial or archaeal species.³¹ Additionally, related members of the genus are determined by NCBI taxonomy.

Interactive Genome Search

A k-mer based nucleotide search is used to power the interactive genome search feature on the portal.³² The sequence search matches all k-mers (k=31) in the query against all available ATCC reference genomes and highlights portions of the sequence that match. The minimum requirement is matching 40 k-mers and 80% of the sequence to call a hit. Search results are listed in descending order by percent of matching k-mers.

Download a PDF of this technical document

Download Now

Blue DNA helix near floating translucent blue rods.

Reference-quality sequences

Through the ATCC Genome Portal, you can easily search, access, and analyze hundreds of reference-quality genome sequences. Our optimized methodology is designed to achieve complete, circularized (when biologically appropriate), and contiguous genomic elements by using short-read (viruses) and hybrid (bacteria and fungi) assembly techniques. We then took our workflow one step further by accompanying each stage of the process with rigorous quality control analyses that ensure our data are the highest quality possible. Only the data that passes all quality control criteria are published to the ATCC Genome Portal. Visit the portal today to find the high-quality data you need for your research.

Visit the Portal

Explore more resources

Microbial Reference Genome Authentication and Traceability

This is a poster presented at World Microbe Forum 2021 that explores the generation of high-quality, curated reference data.

Watch the Presentation

Rod-shaped, purple, blue, fluorescent, knobby bacteria with teeth-like ends.

Authenticated Microbial Reference Genomes for Microbiome Analysis

This poster presented at World Microbe Forum 2021 explores the generation of authenticated reference genomes for use in microbiome research workflows.

Discover More

Green stem attached to a cluster of small, round, purple Aspergillus fumigatus.

High-quality Genome Assemblies and Biosynthetic Gene Clusters Annotation from Laboratory Reference Fungal Strains

This poster presented at World Microbe Forum 2021 explores the generation of high-quality genome assemblies and biosynthetic gene clusters annotation from fungal strains.