E-mail ID : info@iamg.in |
Online Submission |
Click Here For Online Submission |
Instructions for authors |
Genetic Clinics |
Editorial board |
Get Our Newsletter |
Subscribe |
Send Your Feedback |
Feedback Form |
About Us |
IAMG |
GeNeViSTA
File formats
and
terminology | Brief description
|
FASTA | The FASTA format, generally indicated with the suffix .fa or .fasta, is a straightforward, human readable format. Normally, each file consists of a set of sequences, where each sequence is represented by a one line header, starting with the ‘>’ character, followed by the corresponding nucleotide sequence. |
FASTQ | FASTQ is a text file format (human readable) that consists raw sequence reads and its corresponding quality information. It normally provides 4 lines of data per sequence: sequence identifier, the raw sequence, comments (optional), and quality values for the sequence. FASTQ format is commonly used to store sequencing reads, in particular from Illumina and Ion Torrent platforms. Paired-end reads may be stored either in one FASTQ file (alternating) or in two different FASTQ files. |
SAM | SAM stands for Sequence Alignment MAP format. It is a tab delimited text format which stores the mapped or aligned (with reference human genome) sequences. It contains an optional header (typically starts with @) followed by alignment section which contains 12 columns with essential alignment information such as reference sequence name, mapping position, mapping quality, aligner specific information etc. and the aligned sequence reads. |
BAM | BAM stands for Binary Alignment MAP format in which aligned sequences stored in a compressed, indexed and binary form. It provides the binary versions of most of the same data stored in SAM file and is designed to compress reasonably well. Hence the size of the BAM file becomes much less than the SAM file (easy to store). |
VCF | The Variant Calling Format or VCF specifies the format of a text file where all the variant information of the sequences are stored in a compressed manner. VCF file starts with a header section which contains metadata describing the body of the file (denoted as starting with ##) followed by 9 tab-separated specific columns containing information about the variant position in the genome and other columns with genotype information on samples for each position. |
BED | BED stands for Browser Extensible Display format. This format is used for describing genes and other features of DNA sequences. For exome sequencing, usually the genomic regions covered by different commercially available exome capture kits is provided in bed files. These files are freely available in the company websites (e.g. https://earray.chem.agilent.com/suredesign/). Upon request CROs also provide the bedfiles. These bedfiles are useful to restrict the exome data analysis only into the specific regions covered in exome sequencing, if provided in variant calling steps. |
Sophisticated high end work stations (computer) and informatics tools are required to perform exome data analysis along with technical skills such as management and storage of huge amount of NGS data and databases. Moreover, the development of a streamlined and automated guidelines/pipelines for data analysis is very important for generating, annotating and analyzing sequence variants (D’Antonio et al., 2013). There are several bioinformatics workflows, personalized to particular NGS applications depending on the type of variation of interest and the technology employed. One universally recommended and widely used such workflow for variant discovery analysis is GATK (Genome Analysis Toolkit) best practices guidelines, developed by Broad Institute, USA (DePristo et al., 2011; Van der Auwera et al., 2013). However, these guidelines are focused largely on data from human whole-genome or whole-exome samples sequenced with Illumina technology, so working with different types of NGS platforms or experimental designs, requires adaptation to certain branches of the workflow, as well as certain parameter selections and values. Further details of these guidelines are available in the following link (https://software.broadinstitute.org/gatk/best-practices/). A list of different majorly used freely available analytical tools for different steps of exome data analysis is given in Table 2 (Pabinger et al., 2014). A general exome analysis workflow from raw reads (FASTA/FASTQ) to annotated list of variants (text/excel), is given in Figure 1.
Purpose | Softwares/Programs | Source
|
Data quality
checking
and
trimming | FastQC (raw fastq & BAM files) | |
Trimmomatic | ||
NGSrich (BAM files only) | ||
Sequence
alignment
(mapping) | BWA | |
Novoalign | ||
Stampy | ||
Bowtie2 | ||
mrsFAST | ||
NextGenMap | ||
Data
processing | Picardtools | |
SAMtools | ||
Sequence
visualization | IGV | |
|
| |
Variant
calling
(SNV/Indel) | GATK | |
SAMtools | ||
Platypus | ||
Freebayes | ||
SNVer | ||
VarScan 2 | ||
Variant
calling
(CNV) | Conifer | |
ExomeCNV | https://secure.genome.ucla.edu/index.php/ExomeCNV\_User\_Guide |
|
CNVnator | ||
VCF &
BED file
processing
(optional) | BCFtools | |
VCFtools | ||
BEDtools | ||
Variant
Annotation | Annovar | |
wANNOVAR | ||
SeattleSeq Annotation | ||
AnnTools | ||
NGS-SNP | ||
SnpEff | ||
VARIANT | ||
VEP | ||
A typical annotated list of variant file looks like the sample file shown in Figure 2 (usually provided by the CROs as the final output file). The main challenge of analyzing these variants in human diseases is to identify disease-related alleles (which may be known or novel) in a large number of non-pathogenic polymorphisms in the genome. Identification of disease-causing variants in rare Mendelian disorders through exome sequencing relies on different filtering steps to reduce the number of candidate genes. Initial filtering is usually done using different public databases like The International HapMap Consortium, 1000 Genomes Project, Exome Variant Server (EVS), Exome Aggregation Consortium (ExAC), Complete Genomics 69 (CG69), dbSNP etc. and in-house population specific databases (if available). Any variant present in these databases with minor allele frequency (MAF) greater than 0.01 can be excluded from further consideration for rare diseases. Only missense, nonsense, splice-site variants, and indels that are found to affect coding regions are used for clinical interpretation. Clinically relevant mutations are then annotated using published variants in literature and a set of variant databases including ClinVar, OMIM and Human Gene Mutation Database. Then on the basis of the mode of inheritance, for example, a recessive/dominant model, the list of candidate variants can be reduced further. An example of different filtering steps is given in Figure 3 (Das Bhowmik et al., 2015).
Filtering criteria | Number of |
| variants |
Total number of variants | 26,612 |
Variants after filtering for 1000 Genomes (≤0.01) | 3026 |
Variants after filtering for Exome Variant Sewer (≤50.01) | 2692 |
Variants after filtering for Exome Aggregation Consortium (≤50.01) | 1760 |
Variants remaining after filtering for intergenic, intronic and synonymous variants (retained exonic nonsynonymous, splice-site variants and indels causing frameshift) | 896 |
Homozygous and hemizygous variants | 34 |
Variants remaining after excluding Indian polymorphisms from in house data | 25 |
CROs will also generally provide a BAM (Binary Alignment Map) file (comes along with an index file as .bai) which is the comprehensive mapped raw data of exomes for sequence viewing in a high performance visualization tool like Integrative Genomics Viewer (IGV) and a VCF (Variant Call Format) file which contains exome sequence variations. This VCF file can be used for annotating the variants using various online variant annotation tools like wANNOVAR, SeattleSeq etc.
A six year old male child born to non-consanguineous parents, was diagnosed with an unexplained overgrowth syndrome. Clinical features were suggestive of Beckwith–Wiedemann syndrome (BWS). The patient was investigated extensively for BWS including karyotype, array comparative genomic hybridization, methylation analysis at IC1 locus and Sanger sequencing of CDKN1C gene. Since all the results were normal the patient was taken up for WES.
The sequences obtained after WES were analyzed following GATK best practices guidelines (D’Antonio et al., 2013; DePristo et al., 2011). Variant annotation was performed using Annovar for location and predicted function (Wang et al., 2010). Gross filtering was done using 1000 Genomes (≤0.01 MAF), EVS (≤0.01 MAF), ExAC (≤0.01 MAF) and dbSNP databases. Clinically relevant mutations were annotated using published variants in literature and a set of variant databases including ClinVar, OMIM and HGMD. Only non-synonymous, splice site, nonsense and frameshift variants found in the coding regions were used for clinical interpretation. Silent variations that do not result in any change in amino acid in the coding region were excluded.
Exome sequencing resulted in a total number of 26,612 variants. Figure 3 illustrates the filtering strategy used, resulting in a total of 896 exonic nonsynonymous, splice-site and frameshift variants of which 34 variants were homozygous/hemizygous. Since none of the heterozygous variants were related to the phenotype of the patient, only homozygous/hemizygous variants were considered for further analysis. Among these, 9 variants were present in our in-house exome database and excluded from the study. After this, 25 variants were left among which 24 variants were reported in dbSNP with no clinical significance and hence excluded from the study. Finally only one variant was left, which was also relevant to clinical indication, an unreported hemizygous single base pair deletion in exon 8 of GPC3 gene (chrX:132670203delA) in the patient. Mutations in GPC3 are known to cause Simpson-Golabi-Behmel syndrome. Further in silico analysis revealed that this mutation results in a frameshift and is likely to create a new stop codon at 62 amino acids downstream to codon 564 (c.1692delT; p.Leu565SerfsTer63) of the protein. Thus, WES helped in this case to establish the diagnosis in a patient with unexplained overgrowth syndrome as Simpson-Golabi-Behmel syndrome.
With the trend of gradually decreasing cost of exome sequencing, the technology has become imperative in the molecular diagnosis of rare Mendelian disorders. In this era of NGS, it is important for everyone related to this field to have at least some basic knowledge of exome analysis to correctly interpret the results which will ultimately help to carry out the appropriate pretest and post-test counseling of the patients. Also, it is always good to stay in tune with the continuous flow of updates of exome analysis since the technology is still evolving and so also the analytical methods.
It is believed that because of our poor understanding of non-coding genetic variation, the analytical components of most of the whole genome studies have inconsistently depended on variation within the exome. However, if the cost of sequencing continues to fall at this pace, it is possible that the field will gradually move from whole exome to whole genome sequencing. However, taking advantage of the more compact data of exome for disease gene discovery and molecular diagnostics in patients crucially depends on the development of analytical strategies for our understanding of non-coding variation. This is as much an opportunity as it is a challenge.
1. Bamshad MJ, et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet 2011; 12: 745-755.
2. Ng SB, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 2009; 461: 272-276.
3. Ng SB, et al. Exome sequencing identifies the cause of a mendelian disorder. Nat Genet 2010; 42: 30-35.
4. D’Antonio M, et al. WEP: a high-performance analysis pipeline for whole-exome data. BMC Bioinformatics. 2013; 14 Suppl 7: S11.
5. DePristo M, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011; 43: 491-498.
6. Van der Auwera GA, et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics 2013; 43: 11.10.1-33.
7. Pabinger S, et al. A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform 2014; 15: 256-278.
8. Das Bhowmik A, et al. Whole exome sequencing identifies a novel frameshift mutation in GPC3 gene in a patient with overgrowth syndrome. Gene 2015; 572: 303-306.
9. Wang K, et al. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010; 38: e164.
Abstract | Download PDF |