Difference between revisions of "Calculating coverage"

From wiki
Jump to: navigation, search
(Created page with "= Introduction = Coverage can be calculated in several different ways and this pages aims to describe a few of them = Bedtools genomecov = This is one of the most commonly...")
 
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
= Introduction =
 
= Introduction =
  
Coverage can be calculated in several different ways and this pages aims to describe a few of them
+
Coverage can be calculated in several different ways and this pages aims to describe a few of them.
  
= Bedtools genomecov =
+
The most prominent of the coverage tools is probably the bedtools suite.
  
This is one of the most commonly used methods and requires loading of the bedtools module. The base of the command is
+
= BAM files =
 +
 
 +
This is one of the most commonly used methods and requires loading of the bedtools module, also used in the callSNPs.py tool. The base of the command is
 
  bedtools genomecov -bga -ibam <bamfile>
 
  bedtools genomecov -bga -ibam <bamfile>
 +
 +
<u>Explanation</u>:
 +
* the '''-bg''' option is for bedgraph
 +
 +
Note how only the BAM file is required, because though BAM files do not have the sequences of the reference, they do contain the coordinate system.
 +
 +
== in a script with AWK ==
 +
 +
Here is an example script for using '''genomecov''' with AWK to calculate coverage
 +
 +
#!/bin/bash
 +
EXPECTED_ARGS=2
 +
if [ $# -ne $EXPECTED_ARGS ]; then
 +
        echo "This script uses awk to calculate average coverage in a bamfile. It needs $EXPECTED_ARGS arguments"
 +
        echo "Correct usage: $0 <subfolder_with_all_the_bamfiles>> <minimum_coverage_value>"
 +
        exit
 +
fi
 +
module load bedtools
 +
O=$(find $1 -iname "*.bam")
 +
ARRAY=( $O )
 +
for BF in ${ARRAY[@]}; do
 +
        echo -n "$BF: "
 +
        bedtools genomecov -bga -ibam $BF |awk -v t=$2 '{if($4>=t) {j=$3-$2;i=i+j;SUM=SUM+$4*j;}} END {print SUM/(i+0.0);}'
 +
done
 +
 +
= VCF/GFF/BED files =
 +
 +
If you have say a GFF, by itself it is not enough to calculate coverage. A genome file is also needed. This is not exactly the reference file, so some prior operations are required. With he refer you can obtain an index:
 +
 +
samtools faidx <reference.fasta>

Latest revision as of 22:42, 4 March 2017

Introduction

Coverage can be calculated in several different ways and this pages aims to describe a few of them.

The most prominent of the coverage tools is probably the bedtools suite.

BAM files

This is one of the most commonly used methods and requires loading of the bedtools module, also used in the callSNPs.py tool. The base of the command is

bedtools genomecov -bga -ibam <bamfile>

Explanation:

  • the -bg option is for bedgraph

Note how only the BAM file is required, because though BAM files do not have the sequences of the reference, they do contain the coordinate system.

in a script with AWK

Here is an example script for using genomecov with AWK to calculate coverage

#!/bin/bash
EXPECTED_ARGS=2
if [ $# -ne $EXPECTED_ARGS ]; then
        echo "This script uses awk to calculate average coverage in a bamfile. It needs $EXPECTED_ARGS arguments"
        echo "Correct usage: $0 <subfolder_with_all_the_bamfiles>> <minimum_coverage_value>"
        exit
fi
module load bedtools
O=$(find $1 -iname "*.bam")
ARRAY=( $O )
for BF in ${ARRAY[@]}; do
        echo -n "$BF: "
        bedtools genomecov -bga -ibam $BF |awk -v t=$2 '{if($4>=t) {j=$3-$2;i=i+j;SUM=SUM+$4*j;}} END {print SUM/(i+0.0);}'
done

VCF/GFF/BED files

If you have say a GFF, by itself it is not enough to calculate coverage. A genome file is also needed. This is not exactly the reference file, so some prior operations are required. With he refer you can obtain an index:

samtools faidx <reference.fasta>