Skip to content

Whole-Genome Genotyping

Bioinformatic pipeline overview

Whole genome sequencing (WGS) is a comprehensive genomic technique that entails deciphering an organism's complete DNA sequence, offering unparalleled insights into genetic variation and genotyping. By identifying single nucleotide polymorphisms (SNPs), insertions, and deletions across the entire genome, WGS enables precise variant calling and genotyping. This provides an all-encompassing view of the genomic landscape.

bioinformatic-workflow

Report overview

Upon project completion, two interactive summary reports will be shared. The main report will contain a high-level overview, and the key results from your project including total high-quality variant counts, and correlation between samples. The supplementary report contains in-depth details regarding the quality metrics, and findings from individual steps in the bioinformatic pipeline.

The bioinformatics reports for your project are generated using MultiQC, and custom GIFS analytical modules.

The report offers user-friendly navigation on the left for accessing different sections, while the right side provides customization options for report appearance, figure/data export, and sample filtering. The interactive figures enhance data exploration by offering additional details when you hover your mouse over specific data points, including sample names and values. Furthermore, hovering over column names provides comprehensive descriptions of the measurements.

You can also switch from 'OPAL_ID' labels to the 'Sample Names' you provided before the 'Key Project Results' summary table, as shown in the button below:

sample-button

Key Project Results

This table summarizes the key metrics from your project; including total SNP and indel counts, sequencing coverage, percentage of mapped reads, and total number of sequencing reads. As you evaluate your project, here are a few items to keep in mind. - Variant Count Variability: The number of identified variants can vary based on factors such as the genomic reference used, sample origin, and sequencing read depth. The reported numbers represent high-quality variants that have met filtering criteria. - Mapping Percentage: A substantial proportion of reads should align to your reference genome, although this can be influenced by the genetic divergence of your organism. Generally, a mapping rate above 90% is expected for whole-genome alignment."

key-metrics-table

Read Alignment

Preprocessing Sequencing Reads

Short-read paired-end sequencing data is produced using the Illumina NovaSeq 6000 platform. To ensure data quality, the raw sequencing reads undergo a quality assessment using fastp. During this process, low-quality reads are excluded, and any undesired sequences are trimmed before alignment.

In your report, you will find comprehensive filtering metrics that detail both the number and relative percentages of individual sequencing reads that have undergone this quality control process.

fastp-figure

Mapping Sequencing Reads to Reference

Sequencing reads that meet our filtering criteria are aligned to a reference genomic sequence using the BWA aligner. BWA has demonstrated superior sensitivity and precision compared to other alignment tools, particularly in the accurate identification of single nucleotide variants, as highlighted in a study (Kumaran et al, 2019)

The specific reference genomic sequence used in your analysis will be clearly specified in your report.

samtools-figure

Post Alignment Processing

Remove low-quality aligned reads

Sequencing reads that have been successfully aligned to the reference genome are further subjected to quality filtering using samtools. This filtering process identifies and removes reads with low-quality alignment scores.

Specifically, reads that map to multiple locations in the genome, causing ambiguity in determining their true origin, are excluded. This step is crucial to avoid erroneous variant calls, as it eliminates the uncertainty associated with variants that could be attributed to multiple genomic locations.

Deduplication

Throughout library preparation, PCR amplification is employed to generate an adequate amount of DNA copies for sequencing. However, this amplification process can introduce bias by excessively amplifying specific DNA fragments, leading to an overrepresentation of identical sequences in the resulting sequencing data. This overrepresentation poses a risk of generating false-positive variant calls, where the same variant may appear multiple times due to duplicate reads, rather than accurately reflecting the genuine genomic diversity. To address this issue, we employ MarkDuplicates (Picard), a tool that identifies and eliminates these duplicated mapped reads, ensuring the accuracy of variant calling by retaining only unique sequencing data.

dedup-figure

Depth of Coverage

Depth of coverage in genomic variant calling comprises two crucial metrics: sequencing depth and sequencing coverage, both evaluated using Qualimap.

Sequencing depth quantifies how frequently a particular genomic position is sequenced, with higher depth aiding in the accurate identification of genuine variants and the detection of rare variations. Conversely, sequencing coverage represents the proportion of the reference genome encompassed by sequencing reads, ensuring a wider spectrum of variants can be detected.

It's important to note that excessively high depth of coverage can yield diminishing returns in variant detection.

The 'QualiMap - Cumulative genome coverage' chart illustrates the percentage of the reference genome that has been assessed with a minimum depth of coverage.

depth-figure

Joint-Variant Calling

Whole genome joint-variant calling is used to identify genetic variants across multiple individual samples/genomes simultaneously. By jointly analyzing the data, it leverages the shared information to imrpove variant calling accuracy, particularly in regions with lower coverage or higher complexity. placeholder

Filtering High-Quality Variants

High-quality variants are chosen based on pre-defined filtering thresholds, including criteria like read count support (allelic depth - AD and total depth - DP) and a low minor allele frequency (MAF). These criteria ensure the reliability of variant assessment in whole genome sequencing projects.

Your report includes the unfiltered and filtered variant statistics.

variant-filter-figure

Correlation between Samples

Pair-wise identification of matching SNPs

After filtering low-quality variant calls; pairs of samples were compared by counting shared SNP locations. The ratio of identical SNP calls to common SNP locations indicates similarity between samples. A value of 1.0 signifies identical calls, while lower values signify greater divergence or variant count variation. snpCorr-figure

Software versions

Tool Reference Version
seqkit https://github.com/shenwei356/seqkit 2.3.1
fastp https://github.com/OpenGene/fastp 0.23.1
bwa https://github.com/lh3/bwa 0.7.17
samtools https://github.com/samtools/samtools 1.16.1
picard-tools https://broadinstitute.github.io/picard/ 2.26.3
qualimap http://qualimap.conesalab.org/ 2.2.1
R https://www.r-project.org/ 4.2.1
perl https://www.perl.org/ 5.30.2
python https://www.python.org/ 3.10.2
bcftools https://samtools.github.io/bcftools/bcftools.html 1.16
multiqc https://multiqc.info/ 1.14