Difference between revisions of "Bedtools"
(6 intermediate revisions by the same user not shown) | |||
Line 10: | Line 10: | ||
== genomecov == | == genomecov == | ||
+ | |||
+ | The easiest output format is obtained with the '''-bga''' option. Otherwise the default output requires the following explanation. | ||
The example in the documentation ([http://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html bedtools genomecov docs]) isn't very well explained, let's go over it. Note that the input is not a bam file as is more common, but a bed file, in this case. | The example in the documentation ([http://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html bedtools genomecov docs]) isn't very well explained, let's go over it. Note that the input is not a bam file as is more common, but a bed file, in this case. | ||
+ | |||
+ | NOTE: bedtools is fussy about having TABs to separate columns. You need to ensure that they do. If using vim, you will find that the '''set expandtab''' is your enemy here. | ||
$ cat A.bed | $ cat A.bed | ||
Line 17: | Line 21: | ||
chr1 20 30 | chr1 20 30 | ||
chr2 0 500 | chr2 0 500 | ||
+ | |||
+ | Bed files at a minimum give start and end positions on chromosomes or contigs. Bed files often have more columns, but for this example we are dealing with the minimum. This file merely states that a feature, here an arbitrary one, exists for positions 10 to 19 of chromosome 1, and 20 to 29. It's possible that they are different features because otherwise | ||
+ | |||
+ | chr1 10 30 | ||
+ | |||
+ | would have also sufficed. Knowing that, chr2 needs no explanation. | ||
+ | |||
+ | For analysing the coverage a bed file has with respect to the genome is purports to describe, we need a genome file which really is just a list of the chromosomes and contigs and their lengths. | ||
$ cat my.genome | $ cat my.genome | ||
Line 22: | Line 34: | ||
chr2 500 | chr2 500 | ||
+ | The command and output is as follows: | ||
+ | |||
$ bedtools genomecov -i A.bed -g my.genome | $ bedtools genomecov -i A.bed -g my.genome | ||
chr1 0 980 1000 0.98 | chr1 0 980 1000 0.98 | ||
Line 29: | Line 43: | ||
genome 1 520 1500 0.346667 | genome 1 520 1500 0.346667 | ||
− | + | The '''my.genome''' merely gives the total length of both chromosomes. From this, genomecov can get to work, bear in mind that it discards any positional information the bed files contains. It concentrates solely on the coverage measure and gives a line for each coverage category. Here is an explanatio of the output format | |
+ | * first column is the chromosome or contig name. | ||
+ | * the second column is the coverage category for a certain section. | ||
+ | * the third column gives the size of the section corresponding to the current coverage category. | ||
+ | * the fourth column is a bit redundant, ti just gives the length of the current chromosome. | ||
+ | * the fifth column is the coverage calculation for the current line. | ||
− | + | So the above output output says that there is zero coverage of a feature in 98% of chromsome 1, and then lumps the separately lined features on chromosome 1 together, so that they represent 2% of chromosome 1. The 5th column pretty much says the same as the 3th and 4th columns. | |
− | + | This is the raw output and is somewhat cryptic. By adding the the '''-bga''' option we get a more intuitive output which gives the covergae category on the fourth column but in contiguous sections as one travels along the chromosomes/contigs. | |
− | == well-known "one-liner" == | + | === well-known "one-liner" === |
Due to being a very common task, his combination of genomecov piped to awk is very popular (ref. [https://twitter.com/aaronquinlan/status/421786507511205888 twitter link]) | Due to being a very common task, his combination of genomecov piped to awk is very popular (ref. [https://twitter.com/aaronquinlan/status/421786507511205888 twitter link]) | ||
Line 45: | Line 64: | ||
* Background: A file called someregions.bed holds the chromosome ranges where features are present, but the bam file many not necessarily have the alignment | * Background: A file called someregions.bed holds the chromosome ranges where features are present, but the bam file many not necessarily have the alignment | ||
* the "$4==0" is short form awk, which means "only print if Column 4 is zero" | * the "$4==0" is short form awk, which means "only print if Column 4 is zero" | ||
+ | |||
+ | === genome file=== | ||
+ | |||
+ | If you're not dealing with a bam, and are using the '''-i''' option, then a genome file is required. In essence this gives the coordinate space to the genomecov tool, which the BAM file in the previous example already had, and therefore was not required. If you try to run genomecov without it you will get something like | ||
+ | |||
+ | Less than the req'd two fields were encountered in the genome file ... | ||
+ | |||
+ | The good news is that you can get a genome file from your referende in the following manner. First use samtools faidx to get an index file from your reference which in this case is lyrata_genome.fa: | ||
+ | |||
+ | samtools faidx lyrata_genome.fa | ||
+ | |||
+ | This will give a '''fai''' index file. We just need hte first and second columns of this for our genome file, separated by a tab: | ||
+ | |||
+ | awk -v OFS='\t' {'print $1,$2'} lyrata_genome.fa.fai > lyrata_genomeFile.txt | ||
+ | |||
+ | = Links = | ||
+ | * [https://www.biostars.org/p/70795/ generating a genome file] |
Latest revision as of 17:05, 13 March 2017
Contents
Introduction
General bioinformatics tool bult by the active Aaron Quinlan at University of Utah.
The focus of bedtools is genomic features, which can be variously defined.
Usage
bedtools is made out of various subcommands, and these may be dealt with separately
genomecov
The easiest output format is obtained with the -bga option. Otherwise the default output requires the following explanation.
The example in the documentation (bedtools genomecov docs) isn't very well explained, let's go over it. Note that the input is not a bam file as is more common, but a bed file, in this case.
NOTE: bedtools is fussy about having TABs to separate columns. You need to ensure that they do. If using vim, you will find that the set expandtab is your enemy here.
$ cat A.bed chr1 10 20 chr1 20 30 chr2 0 500
Bed files at a minimum give start and end positions on chromosomes or contigs. Bed files often have more columns, but for this example we are dealing with the minimum. This file merely states that a feature, here an arbitrary one, exists for positions 10 to 19 of chromosome 1, and 20 to 29. It's possible that they are different features because otherwise
chr1 10 30
would have also sufficed. Knowing that, chr2 needs no explanation.
For analysing the coverage a bed file has with respect to the genome is purports to describe, we need a genome file which really is just a list of the chromosomes and contigs and their lengths.
$ cat my.genome chr1 1000 chr2 500
The command and output is as follows:
$ bedtools genomecov -i A.bed -g my.genome chr1 0 980 1000 0.98 chr1 1 20 1000 0.02 chr2 1 500 500 1 genome 0 980 1500 0.653333 genome 1 520 1500 0.346667
The my.genome merely gives the total length of both chromosomes. From this, genomecov can get to work, bear in mind that it discards any positional information the bed files contains. It concentrates solely on the coverage measure and gives a line for each coverage category. Here is an explanatio of the output format
- first column is the chromosome or contig name.
- the second column is the coverage category for a certain section.
- the third column gives the size of the section corresponding to the current coverage category.
- the fourth column is a bit redundant, ti just gives the length of the current chromosome.
- the fifth column is the coverage calculation for the current line.
So the above output output says that there is zero coverage of a feature in 98% of chromsome 1, and then lumps the separately lined features on chromosome 1 together, so that they represent 2% of chromosome 1. The 5th column pretty much says the same as the 3th and 4th columns.
This is the raw output and is somewhat cryptic. By adding the the -bga option we get a more intuitive output which gives the covergae category on the fourth column but in contiguous sections as one travels along the chromosomes/contigs.
well-known "one-liner"
Due to being a very common task, his combination of genomecov piped to awk is very popular (ref. twitter link)
bedtools genomescov -ibam aln.bam -bga | awk '$4==0' | bedtools intersect -a someregions.bed -b - > out.txt
Explanation:
- Background: A file called someregions.bed holds the chromosome ranges where features are present, but the bam file many not necessarily have the alignment
- the "$4==0" is short form awk, which means "only print if Column 4 is zero"
genome file
If you're not dealing with a bam, and are using the -i option, then a genome file is required. In essence this gives the coordinate space to the genomecov tool, which the BAM file in the previous example already had, and therefore was not required. If you try to run genomecov without it you will get something like
Less than the req'd two fields were encountered in the genome file ...
The good news is that you can get a genome file from your referende in the following manner. First use samtools faidx to get an index file from your reference which in this case is lyrata_genome.fa:
samtools faidx lyrata_genome.fa
This will give a fai index file. We just need hte first and second columns of this for our genome file, separated by a tab:
awk -v OFS='\t' {'print $1,$2'} lyrata_genome.fa.fai > lyrata_genomeFile.txt