Difference between revisions of "Bedtools"

From wiki
Jump to: navigation, search
Line 12: Line 12:
  
 
The example in the documentation ([http://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html bedtools genomecov docs]) isn't very well explained, let's go over it. Note that the input is not a bam file as is more common, but a bed file, in this case.
 
The example in the documentation ([http://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html bedtools genomecov docs]) isn't very well explained, let's go over it. Note that the input is not a bam file as is more common, but a bed file, in this case.
 +
 +
NOTE: bedtools is fussy about having TABs to separate columns. You need to ensure that they do. If using vim, you will find that the '''set expandtab''' is your enemy here.
  
 
  $ cat A.bed
 
  $ cat A.bed
Line 17: Line 19:
 
  chr1  20  30
 
  chr1  20  30
 
  chr2  0  500
 
  chr2  0  500
 +
 +
Bed files at a minimum give start and end positions on chromosomes or contigs. Bed files often have more columns, but for this example we are dealing with the minimum. This file merely states that a feature, here an arbitrary one, exists for positions 10 to 19 of chromosome 1, and 20 to 29. It's possible that they are different features because otherwise
 +
 +
chr1 10 30
 +
 +
would have also sufficed. Knowing that, chr2 needs no explanation.
 +
 +
For analysing the coverage a bed file has with respect to the genome is purports to describe, we need a genome file which really is just a list of the chromosomes and contigs and their lengths.
 
   
 
   
 
  $ cat my.genome
 
  $ cat my.genome
Line 22: Line 32:
 
  chr2  500
 
  chr2  500
 
   
 
   
 +
With this simple example we can calculate coverage withour eyes and it's clear that the longer chr1 in this case has very low coverage.
 
  $ bedtools genomecov -i A.bed -g my.genome
 
  $ bedtools genomecov -i A.bed -g my.genome
 
  chr1  0  980  1000  0.98
 
  chr1  0  980  1000  0.98
Line 29: Line 40:
 
  genome 1  520  1500  0.346667
 
  genome 1  520  1500  0.346667
  
Bed files at a minimum give start and end positions on chromosomes or contigs. Here we are dealing with the minimum. The '''A.bed''' file merely states that a feature, here an arbitrary one, exists for positions 10 to 19 of chromosome 1, and 20 to 29. It's possible that they are different features because otherwise
+
The '''my.genome''' merely gives the total length of both chromosomes. From this, genomecov can get to work, though it focuses on giving aggregate values, not positional values, for each chromosome. So the resulting output says that therw is zero coverage of a feature in 98% of chromsome 1, and then lumps the separately lined features on chromosome 1 together, so that they represent 2% of chromosome 1. The 5th column pretty much says the same as the 3th and 4th columns.
 
 
chr1 10 30
 
 
 
would have also sufficed. chr2 is self-explanatory. The '''my.genome''' merely gives the total length of both chromosomes. From this, genomecov can get to work, though it focuses on giving aggregate values, not positional values, for each chromosome. So the resulting output says that therw is zero coverage of a feature in 98% of chromsome 1, and then lumps the separately lined features on chromosome 1 together, so that they represent 2% of chromosome 1. The 5th column pretty much says the same as the 3th and 4th columns.
 
  
 
=== well-known "one-liner" ===
 
=== well-known "one-liner" ===

Revision as of 15:50, 5 March 2017

Introduction

General bioinformatics tool bult by the active Aaron Quinlan at University of Utah.

The focus of bedtools is genomic features, which can be variously defined.

Usage

bedtools is made out of various subcommands, and these may be dealt with separately

genomecov

The example in the documentation (bedtools genomecov docs) isn't very well explained, let's go over it. Note that the input is not a bam file as is more common, but a bed file, in this case.

NOTE: bedtools is fussy about having TABs to separate columns. You need to ensure that they do. If using vim, you will find that the set expandtab is your enemy here.

$ cat A.bed
chr1  10  20
chr1  20  30
chr2  0   500

Bed files at a minimum give start and end positions on chromosomes or contigs. Bed files often have more columns, but for this example we are dealing with the minimum. This file merely states that a feature, here an arbitrary one, exists for positions 10 to 19 of chromosome 1, and 20 to 29. It's possible that they are different features because otherwise

chr1 10 30

would have also sufficed. Knowing that, chr2 needs no explanation.

For analysing the coverage a bed file has with respect to the genome is purports to describe, we need a genome file which really is just a list of the chromosomes and contigs and their lengths.

$ cat my.genome
chr1  1000
chr2  500

With this simple example we can calculate coverage withour eyes and it's clear that the longer chr1 in this case has very low coverage.

$ bedtools genomecov -i A.bed -g my.genome
chr1   0  980  1000  0.98
chr1   1  20   1000  0.02
chr2   1  500  500   1
genome 0  980  1500  0.653333
genome 1  520  1500  0.346667

The my.genome merely gives the total length of both chromosomes. From this, genomecov can get to work, though it focuses on giving aggregate values, not positional values, for each chromosome. So the resulting output says that therw is zero coverage of a feature in 98% of chromsome 1, and then lumps the separately lined features on chromosome 1 together, so that they represent 2% of chromosome 1. The 5th column pretty much says the same as the 3th and 4th columns.

well-known "one-liner"

Due to being a very common task, his combination of genomecov piped to awk is very popular (ref. twitter link)

bedtools genomescov -ibam aln.bam -bga | awk '$4==0' | bedtools intersect -a someregions.bed -b - > out.txt


Explanation:

  • Background: A file called someregions.bed holds the chromosome ranges where features are present, but the bam file many not necessarily have the alignment
  • the "$4==0" is short form awk, which means "only print if Column 4 is zero"

genome file

If you're not dealing with a bam, and are using the -i option, then a genome file is required. In essence this gives the coordinate space to the genomecov tool, which the BAM file in the previous example already had, and therefore was not required. If you try to run genomecov without it you will get something like

Less than the req'd two fields were encountered in the genome file ...

The good news is that you can get a genome file from your referende in the following manner. First use samtools faidx to get an index file from your reference which in this case is lyrata_genome.fa:

samtools faidx lyrata_genome.fa

This will give a fai index file. We just need hte first and second columns of this for our genome file, separated by a tab:

awk -v OFS='\t' {'print $1,$2'} lyrata_genome.fa.fai > lyrata_genomeFile.txt

Links