VCF

From wiki
Jump to: navigation, search

Introduction

A file format that records the variants displayed by reads against the reference they are aligned to.

  • Allows multiple sample information by having one column devoted to a sample's variant (relative to reference) information.
  • Clearly the above means that, depending on the number of samples incorporated, the number of columns is variable in a VCF file.
  • This variability is kept track of in the headers at the top of the file which give detailed metadata on what info the columns contain.
  • There is also a general header giving the column names
  • The first seven columns are generally fixed in nature: CHROM, POS, ID, REF, ALT, QUAL, FILTER and refer to the nature of the variant itself
  • The last two columns, headed INFO and FORMAT are also allowed subcolumns: the colon is used to separate these subcolumns
  • The subcolumns in INFO and FORMAT are also variable. INFO s extra information on the variant, and FORMAT gives the information held in the subsequent sample columns
  • The columns referring to the samples (of which, one each) are headed (labelled) by the sample name and the FORMAT column gives the labels for the subcolumns held in each sample column.

Important note: A dot appearing in a subcolumn represents missing information.

BCF is simply the binary (and therefore compressed) version of the file format.

It used to be maintained by the 1000 Genomes Project. The latest version 4.2 and its specification is hosted at the samtools website.

Common FORMAT subcolumns

It's worth re-iterating that all these are per-sample attributes and values.

  • GT: Genotype for the sample at that variant position. Possibly the most important piece of data for the sample. For a SNP this will give "0/0" for homogeneous on the Ref (i.e. the sample is not variant for this position), "1/0" and "0/1" for heterogeneity. "1/1" is homogeneity on the Alt, i.e. the alternative, the Variant SNP.
  • DP refer to overall depth at the locus/position without taking into account base quality
  • DP4 refers to 4 depth readings separated by semicolons. In contrast to DP, these are filtered for base quality. The first pair refer to the depth of reads conforming to reference allele, first on the forward strand, second on the reverse stand. The second pair refer to the alternate allele depth. Again forward strand coming coming first and reverse coming second. A simple example is shown here

Allele Frequencies

Please look at the following GATK forum link for instructions on how to do this.