Difference between revisions of "CallSNPs.py"
(→5. Intersect samples' features to find common features set) |
|||
Line 14: | Line 14: | ||
'''bcftools view''' outputs the variants in VCF format and '''vcfutils.pl''' perl script , which is part of samtools platform, filters for variants on their maximum value via the ''-D'' option. | '''bcftools view''' outputs the variants in VCF format and '''vcfutils.pl''' perl script , which is part of samtools platform, filters for variants on their maximum value via the ''-D'' option. | ||
− | == 3. | + | == 3. Housekeeping: compress vcf files== |
Not strictly part of the sequential run of the pipeline, but at this point the vcf files are compressed with '''samtools'''' bgzip and then indexed with '''tabix''' (also from '''samtools''', a generic tab-file delimited file indexer). | Not strictly part of the sequential run of the pipeline, but at this point the vcf files are compressed with '''samtools'''' bgzip and then indexed with '''tabix''' (also from '''samtools''', a generic tab-file delimited file indexer). | ||
bgzip -c ''vcf_file'' > ''vcf_file.gz'' | bgzip -c ''vcf_file'' > ''vcf_file.gz'' | ||
Line 23: | Line 23: | ||
bedtools genomecov -bga -ibam ''input_bam_file'' | awk -v t=''NUM'' "{if($4>=t)print}" | bedtools merge -i - > ''output_bed_file'' | bedtools genomecov -bga -ibam ''input_bam_file'' | awk -v t=''NUM'' "{if($4>=t)print}" | bedtools merge -i - > ''output_bed_file'' | ||
− | '''bedtools genomecov''' outputs the number of aligned short-read bases in the genome in a format (enabled by the '''-bga''' option) in the fourth column. This is picked up by the '''awk''' program which ignores all those less than '''NUM'''. If '''NUM''' is set to zero, the entire variant callset will be returned in raw form because nothing will be filtered and the '''-bga''' (with '''-a''' presumably standing for ''all'') will include areas with zero coverage (this sounds uninteresting but it could be useful to have, so it's often kept). The final merge, as well as merging the filtered output, also makes sure that split and overlapping features are rolled into one. Note that at this point our variant format VCF turns into a BED file. | + | '''bedtools genomecov''' outputs the number of aligned short-read bases in the genome in a format (enabled by the '''-bga''' option) in the fourth column. This is picked up by the '''awk''' program which ignores all those less than '''NUM'''. If '''NUM''' is set to zero, the entire variant callset will be returned in raw form because nothing will be filtered and the '''-bga''' (with '''-a''' presumably standing for ''all'') will include areas with zero coverage (this sounds uninteresting but it could be useful to have, so it's often kept). The final merge, as well as merging the filtered output, also makes sure that split and overlapping features are rolled into one. Note that at this point our variant format VCF turns into a BED file, and we move from talking about variants to talking about features, which is what bed files record. |
== 5. Intersect samples' features to find common feature-set == | == 5. Intersect samples' features to find common feature-set == | ||
With every short read alignment (i.e. ''BAM'' file) now having its set of features in a corresponding ''BED'' file, we look to find the features that all samples share by using '''bedtools'''' intersect subcommand. So for each sample's ''BED'' file, the following is executed> | With every short read alignment (i.e. ''BAM'' file) now having its set of features in a corresponding ''BED'' file, we look to find the features that all samples share by using '''bedtools'''' intersect subcommand. So for each sample's ''BED'' file, the following is executed> | ||
− | bedtools intersect -a '' | + | bedtools intersect -a ''cumulative_intersect_bed_file'' -b ''sample_bed_file'' > ''single_intersect_bed_file'' |
mv ''single_intersect_file'' ''cumulative_intersect_file'' | mv ''single_intersect_file'' ''cumulative_intersect_file'' | ||
+ | |||
+ | At the end, the cumulative intersect file is renamed so we can know its work is finished. | ||
+ | mv ''cumulative_intersect_bed_file'' ''final_intersect_bed_file'' | ||
+ | |||
+ | == 6. Merge vcf files== | ||
+ | vcftools --gzvcf ''sample_vc_file'' --bed ''final_intersect_bed_file'' --out ''merged_vcf_file'' --recode --keep-INFO-all |
Revision as of 13:12, 3 April 2016
The call SNPS.py script is part of "script_tools" modules.
It requires an alignment in the shape of the base reference sequence and the bam files of the short reads aligned to that sequence (run previously, maybe with bwa or one of the bowtie programs).
Contents
- 1 Stages
- 1.1 1. Calculate likelihoods of each possible variant, and apply prior and call variants
- 1.2 2. Filter variants through a maximum read depth
- 1.3 3. Housekeeping: compress vcf files
- 1.4 4. Extract features with minimum coverage
- 1.5 5. Intersect samples' features to find common feature-set
- 1.6 6. Merge vcf files
Stages
1. Calculate likelihoods of each possible variant, and apply prior and call variants
samtool mpileup -E -d NUM -DSugB -Q 1 -f ref_file input_bam_file | bcftools view -cgbu - > output_bcf_file
mpileup collects summary information from the input BAMs and then computes and stores the likelihood of data given each possible genotype. bcftools applies to prior and then calls the variants. Can be quite raw so often a filter is needed after this.
2. Filter variants through a maximum read depth
bcftools view bcf_file | vcfutils.pl varFilter -DNUM
bcftools view outputs the variants in VCF format and vcfutils.pl perl script , which is part of samtools platform, filters for variants on their maximum value via the -D option.
3. Housekeeping: compress vcf files
Not strictly part of the sequential run of the pipeline, but at this point the vcf files are compressed with samtools' bgzip and then indexed with tabix (also from samtools, a generic tab-file delimited file indexer).
bgzip -c vcf_file > vcf_file.gz tabix -p vcf vcf_file.gz
The latter command creates an extra vcf_file.gz.tbi file in the directory.
4. Extract features with minimum coverage
bedtools genomecov -bga -ibam input_bam_file | awk -v t=NUM "{if($4>=t)print}" | bedtools merge -i - > output_bed_file
bedtools genomecov outputs the number of aligned short-read bases in the genome in a format (enabled by the -bga option) in the fourth column. This is picked up by the awk program which ignores all those less than NUM. If NUM is set to zero, the entire variant callset will be returned in raw form because nothing will be filtered and the -bga (with -a presumably standing for all) will include areas with zero coverage (this sounds uninteresting but it could be useful to have, so it's often kept). The final merge, as well as merging the filtered output, also makes sure that split and overlapping features are rolled into one. Note that at this point our variant format VCF turns into a BED file, and we move from talking about variants to talking about features, which is what bed files record.
5. Intersect samples' features to find common feature-set
With every short read alignment (i.e. BAM file) now having its set of features in a corresponding BED file, we look to find the features that all samples share by using bedtools' intersect subcommand. So for each sample's BED file, the following is executed>
bedtools intersect -a cumulative_intersect_bed_file -b sample_bed_file > single_intersect_bed_file mv single_intersect_file cumulative_intersect_file
At the end, the cumulative intersect file is renamed so we can know its work is finished.
mv cumulative_intersect_bed_file final_intersect_bed_file
6. Merge vcf files
vcftools --gzvcf sample_vc_file --bed final_intersect_bed_file --out merged_vcf_file --recode --keep-INFO-all