Difference between revisions of "VCF"

From wiki
Jump to: navigation, search
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
__TOC__
 
= Introduction =
 
= Introduction =
  
Line 13: Line 14:
  
 
Important note: A dot appearing in a subcolumn represents missing information.
 
Important note: A dot appearing in a subcolumn represents missing information.
 
== Common FORMAT subcolumns ==
 
* GT: Genotype for the sample at that variant position. Possibly the most important piece of data for the sample. For a SNP this will give "0/0" for homogeneous on the Ref (i.e. the sample is not variant for this position), "1/0" and "0/1" for heterogeneity. "1/1" is homogeneity on the Alt, i.e. the alternative, the Variant SNP.
 
 
 
 
one file by Most often it refers to multiple samples and the nature of the variants are described in columns dedicated to each sample. The variants are coded in sinle integers with 0 representing reads which conform to the reference allele.
 
  
 
BCF is simply the binary (and therefore compressed) version of the file format.
 
BCF is simply the binary (and therefore compressed) version of the file format.
Line 25: Line 19:
 
It used to be maintained by the 1000 Genomes Project. The latest version 4.2 and its specification [http://samtools.github.io/hts-specs/VCFv4.2.pdf is hosted] at the samtools website.
 
It used to be maintained by the 1000 Genomes Project. The latest version 4.2 and its specification [http://samtools.github.io/hts-specs/VCFv4.2.pdf is hosted] at the samtools website.
  
= Details =
+
= Common FORMAT subcolumns =
  
 +
It's worth re-iterating that all these are '''per-sample''' attributes and values.
 +
* '''GT''': Genotype for the sample at that variant position. Possibly the most important piece of data for the sample. For a SNP this will give "0/0" for homogeneous on the Ref (i.e. the sample is not variant for this position), "1/0" and "0/1" for heterogeneity. "1/1" is homogeneity on the Alt, i.e. the alternative, the Variant SNP.
 
* '''DP''' refer to overall depth at the locus/position without taking into account base quality
 
* '''DP''' refer to overall depth at the locus/position without taking into account base quality
 
* '''DP4''' refers to 4 depth readings separated by semicolons. In contrast to DP, these are filtered for base quality. The first pair refer to the depth of reads conforming to reference allele, first on the forward strand, second on the reverse stand. The second pair refer to the alternate allele depth. Again forward strand coming coming first and reverse coming second. A simple example is shown [https://www.biostars.org/p/13313 here]
 
* '''DP4''' refers to 4 depth readings separated by semicolons. In contrast to DP, these are filtered for base quality. The first pair refer to the depth of reads conforming to reference allele, first on the forward strand, second on the reverse stand. The second pair refer to the alternate allele depth. Again forward strand coming coming first and reverse coming second. A simple example is shown [https://www.biostars.org/p/13313 here]
 +
 +
= Allele Frequencies =
 +
 +
Please look at the following [https://gatkforums.broadinstitute.org/gatk/discussion/6202/vcf-file-and-allele-frequency GATK forum link] for instructions on how to do this.

Latest revision as of 16:05, 22 October 2017

Introduction

A file format that records the variants displayed by reads against the reference they are aligned to.

  • Allows multiple sample information by having one column devoted to a sample's variant (relative to reference) information.
  • Clearly the above means that, depending on the number of samples incorporated, the number of columns is variable in a VCF file.
  • This variability is kept track of in the headers at the top of the file which give detailed metadata on what info the columns contain.
  • There is also a general header giving the column names
  • The first seven columns are generally fixed in nature: CHROM, POS, ID, REF, ALT, QUAL, FILTER and refer to the nature of the variant itself
  • The last two columns, headed INFO and FORMAT are also allowed subcolumns: the colon is used to separate these subcolumns
  • The subcolumns in INFO and FORMAT are also variable. INFO s extra information on the variant, and FORMAT gives the information held in the subsequent sample columns
  • The columns referring to the samples (of which, one each) are headed (labelled) by the sample name and the FORMAT column gives the labels for the subcolumns held in each sample column.

Important note: A dot appearing in a subcolumn represents missing information.

BCF is simply the binary (and therefore compressed) version of the file format.

It used to be maintained by the 1000 Genomes Project. The latest version 4.2 and its specification is hosted at the samtools website.

Common FORMAT subcolumns

It's worth re-iterating that all these are per-sample attributes and values.

  • GT: Genotype for the sample at that variant position. Possibly the most important piece of data for the sample. For a SNP this will give "0/0" for homogeneous on the Ref (i.e. the sample is not variant for this position), "1/0" and "0/1" for heterogeneity. "1/1" is homogeneity on the Alt, i.e. the alternative, the Variant SNP.
  • DP refer to overall depth at the locus/position without taking into account base quality
  • DP4 refers to 4 depth readings separated by semicolons. In contrast to DP, these are filtered for base quality. The first pair refer to the depth of reads conforming to reference allele, first on the forward strand, second on the reverse stand. The second pair refer to the alternate allele depth. Again forward strand coming coming first and reverse coming second. A simple example is shown here

Allele Frequencies

Please look at the following GATK forum link for instructions on how to do this.