To evaluate the quality of published ATCC genomes from public databases and to demonstrate the need of credible reference-quality genomes, we analyzed a select group of strains via sequencing. Here, 100 bacterial strains identified in our genome database survey as having complete assemblies were randomly selected for analysis. Then, using nucleic acids extracted from low passage ATCC bacterial cultures, we re-sequenced the selected strains and analyzed each sequence using customized reference-based assembly (short-read alignment/mapping to published genome sequences) and hybrid de novo assembly (short- and long-read analysis) workflows.
Preparation and quality control of DNA templates
Unlike many of the bacterial genome sequences deposited in public databases, we began our genome sequencing efforts with the comprehensive traceability of ATCC authenticated strains. This allows us to validate the source of the bacterial culture and genomic DNA while linking to vital metadata, thus enabling downstream references and support for analyses. Briefly, before we engaged in the quality assessment of ATCC genomes present in public databases, we carefully reviewed the classification of the bacterial cultures and evaluated the quality and purity of the DNA template used for NGS sequencing. To facilitate the successful NGS library preparation for multiple sequencing platforms (long- and short-read sequences), we used either input DNA obtained directly from authenticated and fully characterized ATCC nucleic acids from our repository or DNA with high molecular weight (NGS-ready DNA) and fragment sizes bigger than 20 kb that were extracted directly from our cultures. The quality and quantity of the DNA used in this study were measured via a DNA analyzer (Agilent®) and a fluorescent dye-based method PicoGreen®, respectively (Figure 3, Table 2).
Figure 3. Quality assessment for NGS-ready DNA used in this study. The fragment size graph obtained from the Agilent Fragment Analyzer platform demonstrates the size distribution of total DNA. The graphs depict examples of DNA quality assessments for a Gram-negative (A) Escherichia coli (ATCC® 8739DX™) and Gram-positive strain (B) Staphylococcus aureus (ATCC® 6538DX™), respectively.
Table 2. Summary of DNA quality and quantity measurements before NGS
ATCC® no. |
Species |
PicoGreen® (ng/µL) |
A260/A280 |
DNA fragment size (range)** |
8739DX™* |
Escherichia coli |
101.9 |
1.92 |
49.5 kb (1.5 – >60 kb)
|
13048DX™* |
Klebsiella aerogenes |
98.1 |
1.86 |
49.5 kb (1.6 – >60 kb)
|
11828DX™* |
Cutibacterium acnes |
197.7 |
1.84 |
29.8 kb (0.8 – >60 kb)
|
6538DX™* |
Staphylococcus aureus
|
97.8 |
1.85 |
32.9 kb (2.7 – >60 kb)
|
BAA-2797DX™* |
Pseudomonas aeruginosa
|
153.3 |
1.99 |
44.1 kb (1.1 – >60 kb)
|
824D-5™ |
Clostridium acetobutylicum
|
73.8 |
2.05 |
12.5 kb (4.6 – 57.8 kb)
|
6538D-5™ |
Staphylococcus aureus
|
37.1 |
2.00 |
26.2 kb (6.9 – >60 kb)
|
27774D-5™ |
Desulfovibrio desulfuricans
|
69.2 |
1.99 |
58.5 kb (13.3 – >60 kb) |
11842D-5™ |
Lactobacillus delbrueckii
|
64.8 |
2.02 |
41.9 kb (6.1 – >60 kb)
|
15697D-5™ |
Bifidobacterium longum
|
76.2 |
1.95 |
51.3 kb (10.5 – >60 kb)
|
*NGS-ready DNA
**DNA fragment size represents the main peak reported by the fragment analyzer
High-quality NGS sequences
Because NGS has emerged as a sensitive and precise tool for microbial characterization, diagnostics, and discovery, assessing the quality of the raw NGS data has become indispensable for ensuring the credibility of assemblies and the annotation of reference genomes.6,14,15 In public databases, the general submission process for raw sequence data requires some data quality information. Sequence Read Archive (SRA) requires supporting per-base quality scores for all submitted sequences. For the genome assemblies, whole-genome sequencing (WGS) submission requests the base-level quality for which files are not strictly required. However, there are not any standardized sequence quality thresholds that measure or regulate the excellence of the genomic information deposited in public databases.15-17 For this reason, we have developed and implemented a rigorous quality control protocol that includes the analysis of raw sequence quality scores and removal (trimming) of low-quality segments and undefined nucleotides as well as a read-based contamination quality control via the One Codex database (Figure 4). For additional details on the quality control processes we have implemented, see the ATCC Genome Portal Technical Document.
Figure 4. ATCC’s bacterial genome sequencing quality control. The dashed line indicates the quality score cutoff used for each sequencing technology. (A) Quality of Illumina reads. (B) Length distribution of reads from the Oxford Nanopore Technologies (ONT) platform. This approach ensures the longest, highest-quality reads are used for assembly. Thus, the lengths of ONT raw sequence and quality scores were evaluated by measuring read lengths N50 (>5000kb), quality scores (>10), and total yield of sequence runs. (C) Sample composition describes NGS composition by aligning each individual read to a reference database. We use the One Codex microbial genomics platform to perform read-level, k-mer–based taxonomic classification and estimation of strain abundances on our processed Illumina read sets.
Reference-based analysis and variant calling
We evaluated the level of genetic variation between published sequences and NGS sequences obtained directly from ATCC cultures. For 100 sequences identified as ATCC materials in public databases, we ran a reference-based analysis tool on our short reads to identify single nucleotide variations (SNVs) and indels (small insertion/deletion). Briefly, high-accuracy and high-coverage (>100x) Illumina sequences (MiSeq PE 2x300) from ATCC DNAs corresponding to the selected strains were aligned and mapped to published reference assemblies. The genome variants threshold was fixed to a variant average coverage greater than 100x. To validate our results, all of the sequences were first validated by the previously described quality control filter, and then six random strains were sequenced and analyzed in duplicate (Table 4).
Our results demonstrated that approximately 33% of the 100 strains evaluated have fewer than 50 variants (SNVs and indels); 14 strains showed low sequence variation with fewer than 5 variants, and 8 strains showed large sequence variation with more than 500 variants detected. When SNVs and indels were evaluated separately, we found that 18% of the strains exhibited more than 50 SNVs and 37% of the public genomes displayed more than 25 indels. Interestingly, 14 of the selected ATCC strains analyzed from public databases showed more than one assembly record, and three of these contained a different number of plasmids reported between the two separate assembles from the same strain identification (Table 3). Overall, we found that a considerable number of sequenced ATCC strains contain significant variations as compared to their public database counterparts. Without the accurate metadata and sample traceability, it is difficult to identify the source of the variation. In some cases, these variations may be attributable to the incorrect identification of the ATCC isolate before the sequence is submitted (e.g., sequencing from a strain other than the intended ATCC strain). In other cases, the variations may have been caused by differences in strain propagation, DNA extraction, sequencing quality, or downstream assembly analysis, which could influence the overall quality of data in historical sequencing databases.
Table 3. Summary of variant call analysis for strains with more than one database record.
Species |
ATCC® no.
|
Existing Reference Genomes |
NCBI assembly level (plasmids*) |
# of SNPs |
# of indels |
Average coverage (variants) |
Acinetobacter baumannii |
17978™
|
GCA_001593425.2
GCA_000015425.1 |
Complete genome
Complete genome (2) |
14
118 |
5
656 |
210.1
152.7 |
Porphyromonas gingivalis
|
33277™ |
GCA_000010505.1
GCA_002892575.1 |
Complete genome
|
20
24 |
7
8 |
319.5
323.8 |
Staphylococcus epidermidis |
12228™ |
GCA_002215535.1
GCA_000007645.1 |
Complete genome (5)
Complete genome (6) |
56,346
66 |
2,328
35 |
181.2
129.5 |
Fusobacterium nucleatum
|
25586™ |
GCA_003019295.1
GCA_000007325.1 |
Complete genome
|
29
49 |
14
22 |
310.4
289.7 |
Corynebacterium glutamicum
|
13032™ |
GCA_000011325.1
GCA_000196335.1 |
Complete genome
|
18
88 |
2
62 |
216.7
175.0 |
Escherichia coli
|
8739™ |
GCA_000019385.1
GCA_003591595.1 |
Complete genome
|
24
5 |
0
14 |
175.9
179.8 |
Bifidobacterium longum
|
15697™ |
GCA_000020425.1
GCA_000269965.1 |
Complete genome
|
14
5 |
7
5 |
336.1
312.6 |
Vibrio campbellii
|
BAA-1116™ |
GCA_000464435.1
GCA_000017705.1 |
Complete genome [2 chr](1)
|
198
26 |
336
47 |
143.0
107.3 |
Bacillus licheniformis
|
14580™ |
GCA_000008425.1
GCA_000011645.1 |
Complete genome
|
17
14 |
4
5 |
174.4
201.7 |
Vibrio natriegens
|
14048™ |
GCA_001456255.1
GCA_001680025.1 |
Complete genome [2 chr]
|
4
21 |
10
50 |
152.3
70.63 |
*Number in parentheses represent the number of plasmids reported in NCBI assembly report.
To support the sequence variation observed in ATCC genome sequences from public databases and assess the quality of our sequences, we performed independent short-read sequencing in duplicate using different experimental variables (Table 4). We then measured the reproducibility of our analysis via the number of SNVs and indels detected and the level of variant coverage observed.
Table 4. Summary of the reference base mapping analysis from multiple datasets.
Test |
Species |
ATCC® no.
|
Reference Genome |
Analysis |
# of SNPs |
# of indels |
Number of variants |
Variant coverage |
A
|
Mycoplasma hominis |
23114™ |
GCA_000085865.1
|
Preparation 1 |
14 |
10 |
24 |
1042.1 |
Preparation 2 |
14 |
10 |
24 |
900.0 |
Cutibacterium acnes |
11828™ |
GCA_000231215.1
|
Preparation 1
|
28 |
37 |
65 |
121.1 |
Preparation 2 |
28 |
39 |
67 |
128.6 |
B
|
Clostridium acetobutylicum |
824D-5™
|
GCA_000008765.1
|
Kit 1 |
171 |
55 |
226 |
95.9 |
Kit 2 |
170 |
55 |
225 |
202.0 |
Aeromonas hydrophila |
7966D-5™ |
GCA_000014805.1
|
Kit 1 |
1 |
1 |
2 |
216.8 |
Kit 2 |
1 |
1 |
2 |
203.0 |
C
|
Escherichia coli |
700926™ |
GCA_000005845.2
|
Extraction 1 |
0 |
1 |
1 |
137.0 |
Extraction 2 |
0 |
1 |
1 |
186.3 |
Streptococcus pyogenes |
19615™ |
GCA_000743015.1
|
Extraction 1 |
2 |
44 |
46 |
314.2 |
Extraction 2 |
2 |
41 |
43 |
460.2 |
Test A: Same DNA sequenced using two different DNA preparations
Test B: Same DNA sequenced with two different library kits
Test C: Same strain extracted with two different methods