Difference between revisions of "Bwa"

From wiki
Jump to: navigation, search
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
= Introduction =
 
= Introduction =
  
Heng Li's aligner.
+
Samtools originator Heng Li coded this aligner in the c programming language.
  
= Usage =
+
= Overall Usage =
  
 
As with samtools, bwa also went through some re-structuring, so that it has an old-style tw-step ('''aln''' and '''sam{s,p}e''') usage, characterised by the following typical sequence of commands:
 
As with samtools, bwa also went through some re-structuring, so that it has an old-style tw-step ('''aln''' and '''sam{s,p}e''') usage, characterised by the following typical sequence of commands:
Line 12: Line 12:
 
  samtools view -bSu out.sam | samtools sort -  out.sorted
 
  samtools view -bSu out.sam | samtools sort -  out.sorted
  
And then a more modern usage which consists of just one step: '''bwa mem'''.
+
There is a more modern usage which consists of just one step: '''bwa mem'''. Again the output must be directed to a sam file name of your choosing. However the '''aln''' and '''samse''' or '''sampe''' methods are still useful for certain conditions. Please consult documentation.
 +
 
 +
Note how there are no option switches. The prefix for the reference must come first, and the reads must come second.
 +
 
 +
Also note that if there are several readsets, it is best to align and obtain a sam/bam file for each individually (of course, not pairs, the pairs count as one readset). Canonical usage
 +
 
 +
bwa mem index_prefix [input_reads.fastq|input_reads_pair_1.fastq input_reads_pair_2.fastq] [options] > <name>.sam
  
 
== Indexing ==
 
== Indexing ==
  
Before alignment, indexing the reference is necessary. When bwa indexes a reference, it will use the whole filename and generate output index files with extensions added onto this name.
+
Before alignment, indexing the reference is necessary. It is a relatively simple task:
 +
 
 +
bwa index <name_of_reference_file>
 +
 
 +
 
 +
In the case of of reference filename being "unplaced.scaf.fa", the output files will be:
 +
 
 +
* unplaced.scaf.fa.amb, a text file
 +
* unplaced.scaf.fa.ann, a text file
 +
* unplaced.scaf.fa.bwt, a binary file
 +
* unplaced.scaf.fa.pac, a binary file
 +
* unplaced.scaf.fa.sa, a binary file
 +
 
 +
There is only one argument in this case, so when bwa indexes the reference, it will use the whole filename and generate output index files with extensions added onto this name.
 +
 
 +
 
 +
If the reference fasta is referred to by it path and filename, the index files will be placed in the same directory as the reference FASTA, a natural place for them. A second argument is also allowed which allows a custom prefix to be used for these output index files.
  
 
  bwa index input_reference.fasta index_prefix
 
  bwa index input_reference.fasta index_prefix
  
This index_prefix is then used for the actual alignment step.  
+
This index_prefix name can then used for the actual alignment step. Otherwise the filename of the reference file will be used not as a full names but in a name prefix context.
  
 
Indexing may take a long time with large reference files. Here is some example output  
 
Indexing may take a long time with large reference files. Here is some example output  
Line 43: Line 65:
 
  [main] Real time: 5398.987 sec; CPU: 4273.064 sec
 
  [main] Real time: 5398.987 sec; CPU: 4273.064 sec
  
The output files will be, in this case:
 
 
* unplaced.scaf.fa.amb, a text file
 
* unplaced.scaf.fa.ann, a text file
 
* unplaced.scaf.fa.bwt, a binary file
 
* unplaced.scaf.fa.pac, a binary file
 
* unplaced.scaf.fa.sa, a binary file
 
  
 
== Actual alignment step ==
 
== Actual alignment step ==
  
  bwa mem index_prefix input_reads_pair_1.fastq input_reads_pair_2.fastq
+
  bwa mem index_prefix input_reads_pair_1.fastq input_reads_pair_2.fastq > output.sam
  
 
In this case we have paired read-files. With single reads of course only one name would be required.
 
In this case we have paired read-files. With single reads of course only one name would be required.
 +
 +
Note you will probably want to take advantage of threads, so for 8 threads you would run:
 +
 +
bwa mem -t 8 index_prefix input_reads_pair_1.fastq input_reads_pair_2.fastq > output.sam
 +
 +
== Usual post-processing ==
 +
 +
Though you can view these steps as optional, it's highy probably they will be needed in later steps:
 +
 +
* conversion of sam to bam with samtools
 +
* sorting of bam with samtools
 +
* indexing of the sorted file with samtools
 +
* MarkDuplicates with picard-tools
 +
 +
After the those four steps. you then have a clean (so-to-speak) bam file.

Latest revision as of 15:12, 30 May 2017

Introduction

Samtools originator Heng Li coded this aligner in the c programming language.

Overall Usage

As with samtools, bwa also went through some re-structuring, so that it has an old-style tw-step (aln and sam{s,p}e) usage, characterised by the following typical sequence of commands:

bwa index reference.fa
bwa aln -I -t 8 reference.fa s_1.txt > out.sai
bwa samse reference.fa out.sai s_1.txt > out.sam
samtools view -bSu out.sam | samtools sort -  out.sorted

There is a more modern usage which consists of just one step: bwa mem. Again the output must be directed to a sam file name of your choosing. However the aln and samse or sampe methods are still useful for certain conditions. Please consult documentation.

Note how there are no option switches. The prefix for the reference must come first, and the reads must come second.

Also note that if there are several readsets, it is best to align and obtain a sam/bam file for each individually (of course, not pairs, the pairs count as one readset). Canonical usage

bwa mem index_prefix [input_reads.fastq|input_reads_pair_1.fastq input_reads_pair_2.fastq] [options] > <name>.sam

Indexing

Before alignment, indexing the reference is necessary. It is a relatively simple task:

bwa index <name_of_reference_file>


In the case of of reference filename being "unplaced.scaf.fa", the output files will be:

  • unplaced.scaf.fa.amb, a text file
  • unplaced.scaf.fa.ann, a text file
  • unplaced.scaf.fa.bwt, a binary file
  • unplaced.scaf.fa.pac, a binary file
  • unplaced.scaf.fa.sa, a binary file

There is only one argument in this case, so when bwa indexes the reference, it will use the whole filename and generate output index files with extensions added onto this name.


If the reference fasta is referred to by it path and filename, the index files will be placed in the same directory as the reference FASTA, a natural place for them. A second argument is also allowed which allows a custom prefix to be used for these output index files.

bwa index input_reference.fasta index_prefix

This index_prefix name can then used for the actual alignment step. Otherwise the filename of the reference file will be used not as a full names but in a name prefix context.

Indexing may take a long time with large reference files. Here is some example output

[bwa_index] Pack FASTA... 150.69 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=4745806978, availableWord=345932252
[BWTIncConstructFromPacked] 10 iterations done. 99999986 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 199999986 characters processed.
.
.
.
[BWTIncConstructFromPacked] 530 iterations done. 4719199682 characters processed.
[BWTIncConstructFromPacked] 540 iterations done. 4741303618 characters processed.
[bwt_gen] Finished constructing BWT in 543 iterations.
[bwa_index] 2470.36 seconds elapse.
[bwa_index] Update BWT... 129.37 sec
[bwa_index] Pack forward-only FASTA... 302.45 sec
[bwa_index] Construct SA from BWT and Occ... 1220.18 sec
[main] Version: 0.7.12-r1039
[main] CMD: bwa index unplaced.scaf.fa
[main] Real time: 5398.987 sec; CPU: 4273.064 sec


Actual alignment step

bwa mem index_prefix input_reads_pair_1.fastq input_reads_pair_2.fastq > output.sam

In this case we have paired read-files. With single reads of course only one name would be required.

Note you will probably want to take advantage of threads, so for 8 threads you would run:

bwa mem -t 8 index_prefix input_reads_pair_1.fastq input_reads_pair_2.fastq > output.sam

Usual post-processing

Though you can view these steps as optional, it's highy probably they will be needed in later steps:

  • conversion of sam to bam with samtools
  • sorting of bam with samtools
  • indexing of the sorted file with samtools
  • MarkDuplicates with picard-tools

After the those four steps. you then have a clean (so-to-speak) bam file.