Development and Evaluation of Next-generation Sequencing Standards for Virome Research

Closeup of purple virus spheres made up of long, iridescent spikes.

Optimize your virome research with the right controls

Authors: Juan Lopera, PhD;¹ Briana Benton, BS;¹ Jung-Woo Sohn, PhD;¹ Stephen King, MS;¹ Tasha M. Santiago-Rodriguez, PhD;² Emily B. Holister, PhD;² Matthew C. Wong, BS;³ Nadim Ajami, PhD;³ and Cara Wilder, PhD¹

¹ATCC, Manassas, VA 20110
²Diversigen, Houston, TX 77021
³Baylor College of Medicine–Alkek Center for Metagenomics and Microbiome Research, Houston, TX, 77030

Introduction
Development of Mock Viral Communities
Using the Whole Virus Standard
Using the Nucleic Acid Virome Standard
Conclusions
References

Abstract

ATCC has developed standardized reference materials for virome research, viral diagnostics, and next-generation sequencing assay optimization. These standards were developed as whole virus and nucleic acid preparations comprising 6 clinically relevant human viral pathogens of diverse genome type, size, organization, and envelope properties. Here, we describe the development and quality control of the ATCC virome standards and present application data demonstrating their use throughout the different stages of a typical virome analysis workflow.

Download a PDF of this application note

Download Now

Introduction

The natural human virome is the collection of eukaryotic viruses, bacteriophages, and endogenous retroviruses found in and within the human body. While it is recognized that the human virome plays an important role in human health and disease, the composition and function of this community, as well as the interactions that occur between viral species and host cells are not well known. By understanding these complex dynamics, researchers are offered a powerful tool for uncovering pathogenesis mechanisms and developing novel preventive and therapeutic applications.^1-3

Next-generation sequencing (NGS) technologies have enabled shotgun metagenome sequencing of microbial communities on a large scale at an affordable cost, and they have become the gold-standard method used for virome analyses and molecular diagnostics.^4-5 However, despite the significant advancements in these technologies and related data analysis tools, virome research is still challenged by the lack of phylogenetically conserved regions in viruses (akin to the 16S rRNA gene typically used for bacterial and archaeal taxonomic classification), low abundance of viral genome copies, viral sequence variability, and host sequence diversity and abundance.^1,^6-9 The inherent biases that may arise at each stage of the NGS virome analysis workflow further complicate these issues (Figure 1).

To address these challenges, ATCC created whole virus (ATCC MSA-2008) and nucleic acid (ATCC MSA-1008) virome standards that can be used for assay standardization and reproducibility, or as benchmarks for analysis tools. In the following study, we describe the development and validation of these standards and we present application data from several proof-of-concept studies performed by ATCC and 2 external laboratories. Our data demonstrate that virome standards provide a valuable resource for the scientific community that enable extensive benchmarking and comparative evaluation of different sampling methods, NGS approaches, and bioinformatics tools currently used in virome and diagnostic research.

Figure 1. Standard NGS virome analysis workflow highlighting some of the major challenges reported in virome studies.

Development of mock viral communities

ATCC virome standards are fully sequenced, characterized, and authenticated mock viral communities that were developed as whole virus or nucleic acid preparations. These standards comprise 6 viral families (Orthomyxoviridae, Flaviviridae, Pneumovirinae, Reoviridae, Adenoviridae, and Herpesviridae) that were selected on the basis of their clinical relevance and diverse characteristics including genome size, genome type (DNA or RNA), and virion structure (Table 1).

Human and animal cell lines were used to propagate, isolate, and purify whole viruses; nucleic acids (DNA and RNA) were isolated from a preparation of cell lysate and supernatant from specific cell lines Table 1). Following individual viral purification and DNA and RNA isolation and quantitation, whole virus and nucleic acid preparations were mixed in equal proportions based on the genome copy number; the whole virus mix contains approximately 2 × 10³genome copies/μL per virus, and the nucleic acid mix contains approximately 2 × 10⁴ genome copies/μL per virus. Genome copy number was measured via specific fluorescent probes and Droplet Digital PCR (ddPCR; Bio-Rad). After production, both virome standards were evaluated by ddPCR to confirm that all 6 viruses could be detected and that the genome copy number for each virus was close to the expected value (Figure 2). The difference between the observed and expected genome copies for the whole virus standards could be due to differences in viral structure, genome length, or the number of genome segments, which affect nucleic acid extraction and amplification efficiency (Table 1).

Table 1. Selection attributes for strains included in the ATCC Virome Standards.

Virus name	ATCC number	Genome type	Host (ATCC number)*	Virion structure	Reference GenBank ID	Genome size (Kbp)	Relevance
Human herpesvirus 5	VR-538	ds DNA	MRC-5 (CCL-171)	Enveloped	X17403.1	229.4	Ubiquitous infection in adult humans, and significant pathogen within immunocompromised populations¹⁰
Human adenovirus 40	VR-931	ds DNA	HEK-293 (CRL-1573)	Unenveloped	NC_001454.1	34.2	Human gastrointestinal infection and severe infection in children and immunocompromised patients¹¹
Influenza B virus B/Florida/4/2006	VR-1804	ss (-) RNA (8 segments)	SPF embryonated chicken eggs	Enveloped	CY018365.1-CY018372.1	14.2	Causes worldwide human epidemics of influenza with high rates of illness and death¹²
Zika virus	VR-1838	ss (+) RNA	Vero (CCL-81)	Enveloped	KX830960.1	10.8	Mosquito-borne viral infection that can cause congenital microcephaly in fetuses and infants¹³
Human respiratory syncytial virus	VR-1540	ss (-) RNA	HEp-2 (CCL-23)	Enveloped	KT992094.1	15.2	Causes severe respiratory tract infections in humans¹⁴
Reovirus 3	VR-824	ds RNA (10 segments)	LLC-MK2 Derivative (CCL-7.1)	Capsids	HM159613.1-HM159622.1	23.6	Human respiratory and gastrointestinal infection; oncolytic virus^15,16

*ATCC cell line used for viral propagation and nucleic acid isolation.

Figure 2 - Development and Evaluation of Next-Generation Sequencing Standards

Figure 2. Quantification of estimated genome copy number for each virus in the (A) nucleic acid (ATCC MSA-1008) and (B) whole virus (ATCC MSA-2008) standards. Primers and probes were designed for individual strains and ddPCR assays were performed in triplicate with 3 different dilution factors. Data represent the mean genome copies.

Using the whole virus standard to evaluate and standardize virome workflow methodologies

To demonstrate the application of virome standards in the evaluation and standardization of a full virome analysis workflow, we compared the results obtained from 2 different laboratories that processed and analyzed the whole virus virome standard (ATCC MSA-2008). To remove the biases associated with variations between data analysis platforms, we analyzed and compared the sequence data obtained from both laboratories with the same bioinformatics program: VirMAP (developed at Baylor College of Medicine).⁷ VirMAP offers an efficient and accurate solution to determine the correct taxonomy and viral abundance from a complex sample by using both nucleotide and protein information to taxonomically classify viral NGS data following individual viral genome reconstruction.⁷

The results produced from Lab 1 and Lab 2 demonstrated that both analysis workflows were able to detect all 6 viral species in the whole virus mix. However, the viral abundance significantly differed between the 2 studies (Figure 3). Lab 1 used a strategy that included independent steps for library preparation, sequencing, and analysis of DNA and RNA viruses; this strategy produced viral abundances close to the expected values (Figure 3A). In contrast, Lab 2 used the virome standard to design and standardize a different approach that included the use of a single library kit, sequencing run, and analysis to efficiently evaluate both DNA and RNA viruses in the same preparation; this strategy detected less of the RNA viruses as compared to the DNA viruses. The variability observed between the 2 laboratories could be due to the use of different nucleic acid extraction kits, library kits, or sequencing methodology. These results demonstrate the need and utility of the whole virus virome standard (ATCC MSA-2008) as a reference material for facilitating the design and optimization of each stage of a virome workflow.

Figure 3 - Development and Evaluation of Next-Generation Sequencing Standards

Figure 3. Comparison of 2 different experimental approaches for evaluating nucleic acid extraction, library preparation, sequencing, and data analysis by using the whole virus virome standard (ATCC MSA-2008). The sequence files obtained from the 2 laboratories were analyzed using VirMAP.⁷ The total viral reads that were classified as viruses were normalized by individual viral genome size (Table 1); the individual viral abundance is presented as the normalized percent of read.

Using the nucleic acid virome standard to evaluate data analysis tools for viral metagenomic studies

A critical step and major challenge in virome data analysis is the selection of the right bioinformatic tools and viral reference databases required to effectively identify viral sequences within a complex mixture of other microorganism- and host-associated sequences.^7,8 The removal of contaminating reads originating from the host and other organisms through bioinformatics is a frequently used approach that typically focuses on mapping and filtering reads based on their alignment against reference sequences. However, this approach is limited by the availability of known and precise reference genomes.^17,18 To that end, we used ATCC virome standards to benchmark and compare virome analyses by using 2 bioinformatics tools (VirMAP and Autometa) that incorporate innovative approaches that do not require known reference sequences. Briefly, using the nucleic acid virome standard (ATCC MSA-1008) and a similar approach evaluated for Lab 1 (Figure 3), we created separate DNA and RNA libraries and produced independent DNA and RNA shotgun metagenome sequences. We then analyzed the same datasets with the 2 bioinformatic tools. As described above, the VirMAP pipeline uses an approach based on information from nucleotide and protein databases for taxonomic classification of NGS reads and individual viral genome reconstruction.⁷ In contrast, Autometa is a metagenome binning pipeline that splits de novo–assembled contigs into kingdom bins, allowing for the separation of virus contigs from other microorganism and host-derived sequences by examining the lowest common ancestor (LCA) of predicted proteins in each contig to assign the taxonomic classification.¹⁹

Bioinformatic analysis using VirMAP efficiently separated host-associated reads from viral sequences and enabled the calculation of individual viral abundance for both DNA and RNA datasets obtained from the nucleic acid virome standard (ATCC MSA-1008). For shotgun DNA sequencing, 34.2% (9.0 million) of the total reads (26.4 million) were specific to the 2 DNA viral genomes, and the remaining 65.8% (17.4 million) of reads were associated with DNA from the host and other microorganisms. For RNA sequencing, 8.4% (3.3 million) of the total reads (39.2 million) were specific to the 4 RNA viruses, and the remaining 91.6% (35.9 million) were associated with RNA from the host and other microorganisms (Figure 4).

Regarding viral abundance, results from VirMAP showed individual values close to the expected percentages. Here, shotgun DNA sequencing showed a relative abundance of 46.3% for herpesvirus and 53.7% for adenovirus, and shotgun RNA sequencing showed a relative abundance of 38.3% for Zika virus, 19.7% for human respiratory syncytial virus, 20.8% for reovirus, and 21.8% for influenza B virus (Figure 4). Similar to VirMAP, bioinformatic analysis using Autometa was able to properly taxonomically classify DNA and RNA viral sequences from the nucleic acid virome standard (Figure 5). Additionally, both Automata and VirMAP were able to separate individual viral genomes, which enabled the evaluation of NGS depth, genome assembly completeness, and other important viral metagenome measures that researchers need in order to facilitate the standardization and optimization of diagnostic and functional characterization of complex virome samples (Figure 5). Together, these results exemplify and demonstrate the utility of the nucleic acid virome standard in benchmarking, comparing, and selecting an appropriate bioinformatics tool for viral microbiome analysis while also enabling the optimization of NGS parameter runs for the complete genome characterization of virome samples.

Figure 4 - Development and Evaluation of Next-Generation Sequencing Standards

Figure 4. Shotgun metagenome profiling of the nucleic acid virome standard (ATCC MSA-1008). VirMAP analysis results from (A) Illumina DNA shotgun sequencing and (B) Illumina RNA shotgun sequencing of the standard showed efficient host-associated and viral sequencing separation, and the individual viral profiling in the mix was close to the expected abundances.

Figure 5A - Development and Evaluation of Next-Generation Sequencing Standards

Figure 5B - Development and Evaluation of Next-Generation Sequencing Standards

Individual genome assembly statistics of DNA viruses
Measure	Human herpesvirus 5	Human mastadenovirus F
Genome zise¹ (kbp)	229.4	34.2
No. of genome contigs²	6	5
Assembly length (kbp)	212.7	35.0
Assembly N₅⁰ (kbp)	52.1	24.8
Assembly coverage (%GC)	2761 (57.2)	2995 (51.2)
Genome completeness³	96.4%	97.0%

Individual genome assembly statistics of RNA viruses
Measure	Influenza B virus	Zika virus	HRSV	Reovirus 3
Genome zise¹ (kbp)	14.2	10.8	15.2	23.6
No. of genome contigs²	8	1	1	9
Assembly length (kbp)	16.6	10.9	15.3	23.5
Assembly N₅⁰ (kbp)	0.2234	10.88	15.31	0.2231
Assembly coverage (%GC)	2244 (40.6)	3903 (51.0)	2001 (33.2)	2057 (47.0)
Genome completeness³	100.0%	100.0%	100.0%	90.0%

1 = Expected based on the reference GenBank ID from Table 1.
2 = contigs ≥1 kbp
3 = Estimated completeness [(number of CDS estimated in reference GenBank / number of CDS estimated in individual viral bin) × 100]. Number of CDS were estimated using Prokka software tool

Figure 5. Shotgun metagenome binning analysis of the nucleic acid virome standard (ATCC MSA-1008). Autometa analysis results and individual viral genome assembly measured from (A) Illumina DNA shotgun sequencing and (B) Illumina RNA shotgun sequencing of the virome standard. Metagenomic binning plots represents 2-dimensional clustering (%GC and coverage components) of individual de novo assembly contigs (colored spots represent kingdom bins and stars show viral bins).

Conclusions

Overall, these proof-of-concept studies highlight the utility and importance of virome standards in the optimization of sample processing, sequencing, and bioinformatics methods throughout a virome analysis workflow. Due to the complexity of virome sample processing and sequencing, significant challenges can be posed by biases introduced during sample preparation, nucleic acid extraction, library preparation, sequencing, or data interpretation. By incorporating standards in the assay development and validation processes, researchers are offered a comprehensive solution for standardizing data from a wide range of sources and generating consensus among viral microbiome studies.

Acknowledgments

We would like to thank Dr. Jason Kwan at the University of Wisconsin-Madison for his support during the installation and application of the Autometa pipeline.

Download a PDF of this application note

Download Now

Products featured in this application note

The image shows a close-up view of several DNA double helix structures. The DNA strands are depicted in a light blue color against a blurred blue background, highlighting the twisted ladder-like appearance of the double helix.

Virome Nucleic Acid Mix

This product comprises an even mixture of nucleic acids prepared from fully sequenced, characterized, and authenticated viral strains selected on the basis of genomic size, DNA/RNA genome, envelope/non-envelope, and other special features.

Order now

Round, iridescent, green-blue H1N1 (swine flu) virus with protrusions.

Virome Virus Mix

This virome standard is a mock microbial community that mimics mixed metagenomic samples. This product comprises an even mixture of fully sequenced, characterized, and authenticated viral strains.

Purchase Now

Small, round, fluorescent green and gray herpes simplex cells.

Human herpesvirus 5

Viral strain propagated in MRC-5 cells that is known to cause cell enlargement, rounding, and sloughing. This strain has applications in virucide testing.

Buy Today

Green WI-38 primary lung fibroblast cell culture.

MRC-5

This cell line was derived from normal human lung tissue. It has applications in efficacy testing and as a transfection host.

Order Now

Human mastadenovirus F

Viral strain isolated from the feces of a child with gastroenteritis. This product has applications in enteric disease research.

Buy Now

HEK-293

This cell line is derived from the human kidney. It has applications in high-throughput screening and toxicology testing.

Buy the Cells

Iridescent influenza virus spheres with protruding spikes

Influenza B virus (Yamagata Lineage)

This viral strain was isolated from a human in Florida in 2006. It is propagated in SPF embryonated chicken eggs and is known to cause the hemagglutination of red blood cells.

Order Today

Zika virus

This strain of Zika virus was isolated in the blood of a rhesus monkey infected in Uganda. This strain was tissue culture-adapted to grow in Vero cells.

Buy the Virus

Vero

This epithelial cell line was derived from the kidneys of Cercopithecus aethiops. Vero cells have numerous applications, including use as a transfection host, efficacy testing, and media testing.

Order Now

Spherical, bright, fluorescent blue, purple, green, and red nasal epithelial RSV cells.

Human respiratory syncytial virus

This viral strain was isolated from the lower respiratory tract of infant with bronchiolitis and bronchopneumonia in Melbourne.

Buy Now

Fluourescent blue, red, and green respiratory synctial virus.

HEp-2

Human epithelial cells derived from carcinoma. This cell line is susceptible to Human adenovirus 3, Human poliovirus 1, and Vesicular stomatitis virus.

Order Today

Reovirus 3

Reovirus 3 strain Dearing was isolated from a child with diarrhea. This strain is propagated in LLC-MK2 Derivative cells and results in cell sloughing and granulation.

Buy Now

Red single plaque owl monkey kidney cell.

LLC-MK2 Derivative

This cell line was derived from the kidneys of rhesus monkeys. It is susceptible to Human poliovirus 1-3 and is known to express keratin.

Buy the Cells

References

Santiago-Rodriguez TM, Hollister EB. Human Virome and Disease: High-Throughput Sequencing for Virus Discovery, Identification of Phage-Bacteria Dysbiosis and Development of Therapeutic Approaches with Emphasis on the Human Gut. Viruses 11(7): pii: E656, 2019.
Beller L, Matthijnssens J. What is (not) known about the dynamics of the human gut virome in health and disease. Curr Opin Virol 37: 52-57, 2019.
Carding SR, Davis N, Hoyles L. Review article: the human intestinal virome in health and disease. Aliment Pharmacol Ther 46(9): 800-815, 2017.
Gu W, Miller S, Chiu CY. Clinical Metagenomic Next-Generation Sequencing for Pathogen Detection. Annu Rev Pathol 14: 319-338, 2019.
Datta S, et al. Next-generation sequencing in clinical virology: Discovery of new viruses. World J Virol 4(3): 265-276, 2015.
Lambert C, et al. Considerations for Optimization of High-Throughput Sequencing Bioinformatics Pipelines for Virus Detection. Viruses 10(10): pii: E528, 2018.
Ajami NJ, et al. Maximal viral information recovery from sequence data using VirMAP. Nat Commun 9(1): 3205, 2018.
Sutton TDS, et al. Choice of assembly software has a critical impact on virome characterisation. Microbiome 7(1): 12, 2019.
Wylie KM, Weinstock GM, Storch GA. Emerging view of the human virome. Transl Res 160(4): 283-290, 2012.
Jackson JW, Sparer T. There Is Always Another Way! Cytomegalovirus’ Multifaceted Dissemination Schemes. Viruses 10(7): 383, 2018.
Chen S, Tian X. Vaccine development for human mastadenovirus. J Thorac Dis 10(Suppl 19): S2280-S2294, 2018.
Krammer F, et al. Influenza. Nat Rev Dis Primers 4(1): 3, 2018.
Musso D, Ko AI, Baud D. Zika Virus Infection–After the Pandemic. N Engl J Med 381(15): 1444-1457, 2019.
Rossey I, Saelens X. Vaccines against human respiratory syncytial virus in clinical trials, where are we now? Expert Rev Vaccines 18(10): 1053-1067, 2019.
Besozzi M, et al. Host range of mammalian orthoreovirus type 3 widening to alpine chamois. Vet Microbiol 230: 72-77, 2019.
Kemp V, et al. Characterization of a replicating expanded tropism oncolytic reovirus carrying the adenovirus E4orf4 gene. Gene Ther 25(5): 331-344, 2018.
Miller JR, et al. A host subtraction database for virus discovery in human cell line sequencing data. F1000Res 7: 98, 2018.
Daly GM, et al. Host Subtraction, Filtering and Assembly Validations for Novel Viral Discovery Using Next Generation Sequencing Data. PLoS One 10(6): e0129059, 2015.
Miller IJ, et al. Autometa: automated extraction of microbial genomes from individual shotgun metagenomes. Nucleic Acids Res 47(10): e57, 2019.

Development and Evaluation of Next-generation Sequencing Standards for Virome Research

Optimize your virome research with the right controls

Abstract

Introduction

Development of mock viral communities

Using the whole virus standard to evaluate and standardize virome workflow methodologies

Using the nucleic acid virome standard to evaluate data analysis tools for viral metagenomic studies

Conclusions

Acknowledgments

Products featured in this application note

Virome Nucleic Acid Mix

Virome Virus Mix

Human herpesvirus 5

MRC-5

Human mastadenovirus F

HEK-293

Influenza B virus (Yamagata Lineage)

Zika virus

Vero

Human respiratory syncytial virus

HEp-2

Reovirus 3

LLC-MK2 Derivative

References

Please subscribe me to the following: