Hybrid assembly is a state-of-the-art technique that uses both highly accurate Illumina short reads and ultra-long scaffolding ONT reads.13 For bacterial assemblies, this technique uses Unicycler, which begins with an optimized Illumina assembly. The longest of these Illumina-based contigs are then assembled alongside the ONT reads; this combined assembly then undergoes multiple rounds of both long-read and short-read polishing.8 Because occasional sequencing and assembly artifacts appear as small contigs in the final assembly (so-called “chaff” contigs14), non-contiguous contigs less than 1000 bp with low relative depth are then removed to produce the final assembly. Please note that prior to bacterial genome assembly, the genome size is estimated using MASH, and the high-quality Illumina and ONT reads are down-sampled to 150X and 30X depth respectively based on estimated genome size.
For fungal assemblies, we down-sample the reads as for the bacterial assemblies, and then use the MaSuRCA (hybrid assembly algorithm combines Illumina and ONT reads to construct long and accurate mega-reads) pipeline with the Flye assembler.15 MaSuRCA was chosen for its strengths with large genomes.9
Genome Assembly Quality Control
Illumina Read Set Coverage
Although the depth of Illumina reads required is influenced by numerous factors (including, but not limited to, microbial strains),16,17 Illumina read sets should be sufficient to cover the entire genome to obtain the most accurate base determination.18 To account for variance in distribution of coverage per base, we require a minimum of 100X average depth for Illumina reads.
Bacterial Completeness and Contamination
To ensure our bacterial assembly process has correctly captured the entirety of a given strain’s genome, and to confirm the absence of contamination from the assembly, we pass finalized assemblies through CheckM.19 Briefly, CheckM uses a set of Hidden Markov Models (HMMs) from phylogenetically close bacterial and archaeal reference genomes to determine if the query assembly contains all expected HMMs as predicted by the reference genomes (a percentage called “CheckM completeness”), and it evaluates what percent of the query’s HMMs differ in copy number or come from reference genomes that are phylogenetically distant (called “CheckM contamination”). We required final assemblies to have completeness values ≥ 95% and contamination values ≤ 5% (e.g., within the margin of error for 0% completeness and contamination, which indicates them as excellent reference sequences according to the authors of CheckM).
Mycology Completeness and Contamination
For mycology genomes, we estimate completeness using BUSCO.20 BUSCO is a tool/database combo widely used in the mycology field that examines the presence of a selection of universal single-copy orthologs for quantitative completeness calculations. We use fungi-specific databases where orthologs must be identified in at least 90% of the fungal species, and no single copy ortholog can be entirely missing from any sub-clade in the databases. Unlike CheckM, BUSCO does not calculate % contamination. We require fungal assemblies to have a completeness value of ≥ 80%.
Viral Genome Assembly and Quality Assessment
As viruses are co-cultured with their host, viral DNA or RNA sequencing data may contain reads from both the host and the virus, and de novo assemblies may contain contigs from both the host and viral genome.21 In order to produce an assembly containing contigs of a single virus; host reads can be removed or contigs can be binned taxonomically.22 Taxonomic binning is performed by aligning reads or contigs to One Codex’s curated NCBI Reference Sequence Database. Reads or contigs that align to “cellular organisms” are binned as nonviral, while those that do not are binned as viral. For our approach, high-quality viral reads are used for de novo assembly using SPAdes.23 To achieve the goal of obtaining complete assemblies for a single virus, the contig binning approach was used. Contigs that align to the Escherichia coli bacteriophage Phi-X 174 genome are excluded as this is used as a DNA spike-in for Illumina sequencing.
In addition to the problem of taxonomic binning, viral genomes are diverse in structure with many viruses having multipartite genome segments; the genome of Influenza A virus, for example, consists of 8 separate strands of RNA.24 To determine whether an assembly contains all the necessary segments, a curated database of complete viral genomes and segment information was constructed. After taxonomic binning, contigs are then aligned to the Viral Genomes-NCBI-NIH database to apply segment labels, segment depth, and percent identity to the closest reference.