Difference between revisions of "CallSNPs.py"

From wiki
Jump to: navigation, search
Line 5: Line 5:
 
= Stages =
 
= Stages =
  
1. Calculate likelihoods of each possible variant, and apply prior and call variants.
+
==1. Calculate likelihoods of each possible variant, and apply prior and call variants==
 
  samtool mpileup -E -d ''NUM'' -DSugB -Q 1 -f ''ref_file'' ''input_bam_file'' | bcftools view -cgbu - > ''output_bcf_file''
 
  samtool mpileup -E -d ''NUM'' -DSugB -Q 1 -f ''ref_file'' ''input_bam_file'' | bcftools view -cgbu - > ''output_bcf_file''
  
Line 11: Line 11:
  
  
2. Filter variants through a maximum read depth
+
==2. Filter variants through a maximum read depth ==
 
  bcftools view ''bcf_file'' | vcfutils.pl varFilter -D''NUM''
 
  bcftools view ''bcf_file'' | vcfutils.pl varFilter -D''NUM''
 
'''bcftools view''' output the variants in VCF format and '''vcfutils.pl''' perl script , which is part of samtools platform, filters for variants via the ''-D'' option that sets the maximum  value
 
'''bcftools view''' output the variants in VCF format and '''vcfutils.pl''' perl script , which is part of samtools platform, filters for variants via the ''-D'' option that sets the maximum  value
  
3. Extract features with minimum coverage
+
== 3. Extract features with minimum coverage ==
 
  bedtools genomecov -bga -ibam ''input_bam_file'' | awk -v t=''NUM'' '{if($4>=t)print}' | bedtools merge -i - > ''output_bed_file''
 
  bedtools genomecov -bga -ibam ''input_bam_file'' | awk -v t=''NUM'' '{if($4>=t)print}' | bedtools merge -i - > ''output_bed_file''
  
 
'''bedtools genomecov''' outputs the number of aligned short-read bases in the genome in a format (enabled by the '''-bga''' option) in the fourth column. This is picked up by the '''awk''' which ignores all those less than '''NUM'''. If '''NUM''' is set to zero, the entire variant callset will be returned in raw form becuase nothing will be filtered and the '''-bga''' (with ''a''' presumably standing for ''all'')  includes areas with zero coverage (this sounds uninteresting but it could be useful to have, so it's often kept).
 
'''bedtools genomecov''' outputs the number of aligned short-read bases in the genome in a format (enabled by the '''-bga''' option) in the fourth column. This is picked up by the '''awk''' which ignores all those less than '''NUM'''. If '''NUM''' is set to zero, the entire variant callset will be returned in raw form becuase nothing will be filtered and the '''-bga''' (with ''a''' presumably standing for ''all'')  includes areas with zero coverage (this sounds uninteresting but it could be useful to have, so it's often kept).

Revision as of 21:07, 2 April 2016

The call SNPS.py script is part of "script_tools" modules.

It requires an alignment in the shape of the base reference sequence and the bam files of the short reads aligned to that sequence (run previously, maybe with bwa or one of the bowtie programs).

Stages

1. Calculate likelihoods of each possible variant, and apply prior and call variants

samtool mpileup -E -d NUM -DSugB -Q 1 -f ref_file input_bam_file | bcftools view -cgbu - > output_bcf_file

mpileup collects summary information from the input BAMs and then computes and stores the likelihood of data given each possible genotype. bcftools applies to prior and then calls the variants. Can be quite raw so often a filter is needed after this.


2. Filter variants through a maximum read depth

bcftools view bcf_file | vcfutils.pl varFilter -DNUM

bcftools view output the variants in VCF format and vcfutils.pl perl script , which is part of samtools platform, filters for variants via the -D option that sets the maximum value

3. Extract features with minimum coverage

bedtools genomecov -bga -ibam input_bam_file | awk -v t=NUM '{if($4>=t)print}' | bedtools merge -i - > output_bed_file

'bedtools genomecov outputs the number of aligned short-read bases in the genome in a format (enabled by the -bga option) in the fourth column. This is picked up by the awk which ignores all those less than NUM. If NUM is set to zero, the entire variant callset will be returned in raw form becuase nothing will be filtered and the -bga (with a presumably standing for all) includes areas with zero coverage (this sounds uninteresting but it could be useful to have, so it's often kept).