Difference between revisions of "Bottlenose dolphin population genomic analysis"
(13 intermediate revisions by the same user not shown) | |||
Line 7: | Line 7: | ||
== Quality trimming == | == Quality trimming == | ||
− | Using trimmomatic | + | Using trimmomatic, which allows various quality trimming options |
+ | |||
+ | java -jar $TRIMMOJARFILE PE -threads 6 -phred33 <path/to/forward_read> <path/to/reverse_read> <forward_paired_name> <forward_unpaired_name><reverse_paired_name><reverse_paired_name> ILLUMINACLIP:/path/to/adapter/TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:75 | ||
+ | |||
+ | <ins>Explanation</ins>: | ||
+ | * There are six arguments unattached to a particular option, consisting of the 2 inputs and the four outputs (because we will have paired and unpaired trimmed datasets). They are unattached because their exact position and order is assumed to be as above. | ||
+ | * ILLUMINACLIP | ||
+ | * LEADING | ||
+ | * TRAILING | ||
+ | * SLIDINGWINDOW | ||
+ | * MINLEN | ||
+ | |||
+ | == Map to mitochondrial reference genome == | ||
MITOREFLOC=/storage/home/users/ml228/Genomics/NEA_BGI_data_ALL/NEA_BGI_clean_raw_data/Refs/mitogenome/gi_557468684_gb_KF570351.1_.fasta | MITOREFLOC=/storage/home/users/ml228/Genomics/NEA_BGI_data_ALL/NEA_BGI_clean_raw_data/Refs/mitogenome/gi_557468684_gb_KF570351.1_.fasta | ||
Line 13: | Line 25: | ||
* FQN1, the forward paired output from Trimmomatic | * FQN1, the forward paired output from Trimmomatic | ||
* FQN2, the reverse paired output from Trimmomatic | * FQN2, the reverse paired output from Trimmomatic | ||
− | * | + | * MTSAM, the raw sam alignment to mito, unsorted |
− | MTBAM | + | * MTBAM, the alignment to mito, sorted and now in bam format. |
− | + | ||
+ | * UBAM, the unmapped bam reflecting exclusion of mitochondrial mappings | ||
TFQ1=${SSD1}/${SDN}/${SSD1}_${SRN1}_${SLN1}_forward_paired.fq.gz # trimmed fq data file, forward paired reads | TFQ1=${SSD1}/${SDN}/${SSD1}_${SRN1}_${SLN1}_forward_paired.fq.gz # trimmed fq data file, forward paired reads | ||
TFQ2=${SSD2}/${SDN}/${SSD2}_${SRN2}_${SLN2}_reverse_paired.fq.gz # trimmed fq data file, reverse paired reads | TFQ2=${SSD2}/${SDN}/${SSD2}_${SRN2}_${SLN2}_reverse_paired.fq.gz # trimmed fq data file, reverse paired reads | ||
Line 25: | Line 38: | ||
First we align to the mitochondrial reference: | First we align to the mitochondrial reference: | ||
− | bwa mem -t $NSLOTS $MITOREFLOC | + | bwa mem -t $NSLOTS $MITOREFLOC <quality_trimmed_forward_paired> <quality_trimmed_reverse_paired> > <alignedtomitochondrial_samfile> |
The resulting sam is sorted and converted to bam format: | The resulting sam is sorted and converted to bam format: | ||
− | samtools view -bSh | + | samtools view -bSh <alignedtomitochondrial_samfile> | samtools sort - <alignedtomitochondrial_bamfile_withoutextension> -@ $NSLOTS |
+ | |||
+ | Note that the ''''.bam''' extension is not required in the output of samtools sort when its input and output are bam. | ||
+ | |||
+ | == Re-generate readsets which will no longer map to mitochondrial genome == | ||
− | Actually we're only interested in the reads that did not map to the mitochondrial, and we can extract them into the their own bam with: | + | Actually, we're only interested in the reads that did not map to the mitochondrial, and we can extract them into the their own bam with: |
− | samtools view -f4 -bh | + | samtools view -f4 -bh <alignedtomitochondrial_bamfile> > <unmappedotmito_bamfile> -@ $NSLOTS |
<ins>Explanation</ins>: | <ins>Explanation</ins>: | ||
Line 48: | Line 65: | ||
bwa mem -t $NSLOTS $REFLOC $FQN1 $FQN2 > $REASAM | bwa mem -t $NSLOTS $REFLOC $FQN1 $FQN2 > $REASAM | ||
− | The raw sam can be sorted and converted to bam to produce the alignment with the reads mapping to | + | The raw sam can be sorted and converted to bam to produce the alignment with the reads mapping to the mitochondrial reference excluded as follows: |
samtools view -bSh ${REASAM} | samtools sort - ${REABAM%.*} -@ $NSLOTS | samtools view -bSh ${REASAM} | samtools sort - ${REABAM%.*} -@ $NSLOTS | ||
+ | |||
+ | == Marking Duplicates == | ||
+ | |||
+ | As usual, the picard-tools function will be used for this. It can operate on multiple BAM files in one run, so we may apply it on a per-sample basis because we have several datasets for each sample in this pipeline. | ||
+ | |||
+ | java -Xmx32G -jar $PICARDJARFILE MarkDuplicates REMOVE_DUPLICATES=true I=<bamfile1> I=<bamfile2> I=... OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 O=<de-duplicated_bamfile> M=<marked_dup_metrics.txt> | ||
+ | |||
+ | Both the '''O''' (output) and '''M''' (metrics) options (filenames with relative directory paths) are chosen by the user. | ||
+ | |||
+ | We also generate an index for this bamfile which helps tools in downstream analysis. | ||
+ | |||
+ | samtools index <de-duplicated_bamfile> <de-duplicated_baifile> | ||
+ | |||
+ | == Realigning after duplicate identification == | ||
+ | |||
+ | An initial "target creator" step is required first because the '''IndelRealigner''' has a '''-targetIntervals''' option. | ||
+ | |||
+ | java -Xmx16G -jar /path/to/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ${REFSEQLOC} -I <de-duplicated_bamfile> -o <chosen_targetcreator_intervals_filename> | ||
+ | |||
+ | Now the '''IndelRealigner''' tool can be invoked: | ||
+ | |||
+ | java -Xmx16G -Djava.io.tmpdir=/tmp -jar /path/to/GenomeAnalysisTK.jar -T IndelRealigner -R ${REFSEQLOC} -targetIntervals <chosen_targetcreator_intervals_filename> -I <de-duplicated_bamfile> -o <chosen_realigned_bamfile> | ||
+ | |||
+ | An extra bit of quality filtering is also included here: | ||
+ | samtools view -bh -F4 -q 30 <chosen_realigned_bamfile> <realigned_qualityfiltered_bamfile> -@ $NSLOTS | ||
+ | |||
+ | == Repeatmasking using a bed file == | ||
+ | |||
+ | In this case, we have already run a repeatmasking exercise on the reference genome so we may safely apply it to the quality filtered bamfile produced in the previous step using bedtools: | ||
+ | |||
+ | intersectBed -abam <realigned_qualityfiltered_bamfile> -b <bedfile> -v | samtools view -h - | samtools view -bSh - > <no_repeats_bamfile> | ||
+ | |||
+ | where the bedfile represents the repeatmasked regions. |
Latest revision as of 16:07, 14 March 2017
Contents
Introduction
Used for population genetics studies.
Stages
Quality trimming
Using trimmomatic, which allows various quality trimming options
java -jar $TRIMMOJARFILE PE -threads 6 -phred33 <path/to/forward_read> <path/to/reverse_read> <forward_paired_name> <forward_unpaired_name><reverse_paired_name><reverse_paired_name> ILLUMINACLIP:/path/to/adapter/TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:75
Explanation:
- There are six arguments unattached to a particular option, consisting of the 2 inputs and the four outputs (because we will have paired and unpaired trimmed datasets). They are unattached because their exact position and order is assumed to be as above.
- ILLUMINACLIP
- LEADING
- TRAILING
- SLIDINGWINDOW
- MINLEN
Map to mitochondrial reference genome
MITOREFLOC=/storage/home/users/ml228/Genomics/NEA_BGI_data_ALL/NEA_BGI_clean_raw_data/Refs/mitogenome/gi_557468684_gb_KF570351.1_.fasta
- FQN1, the forward paired output from Trimmomatic
- FQN2, the reverse paired output from Trimmomatic
- MTSAM, the raw sam alignment to mito, unsorted
- MTBAM, the alignment to mito, sorted and now in bam format.
- UBAM, the unmapped bam reflecting exclusion of mitochondrial mappings
TFQ1=${SSD1}/${SDN}/${SSD1}_${SRN1}_${SLN1}_forward_paired.fq.gz # trimmed fq data file, forward paired reads TFQ2=${SSD2}/${SDN}/${SSD2}_${SRN2}_${SLN2}_reverse_paired.fq.gz # trimmed fq data file, reverse paired reads ACPTSZ=1000000 # acceptable size of a bam, a somewhat (only) risky way of seeing that it's OK.
REABAM=${SSD1}/${MOD}/aln-pe${SSD1}_${SRN1}_${SLN1}_sorted.bam REASAM=${SSD1}/${MOD}/aln-pe${SSD1}_${SRN1}_${SLN1}.sam
First we align to the mitochondrial reference:
bwa mem -t $NSLOTS $MITOREFLOC <quality_trimmed_forward_paired> <quality_trimmed_reverse_paired> > <alignedtomitochondrial_samfile>
The resulting sam is sorted and converted to bam format:
samtools view -bSh <alignedtomitochondrial_samfile> | samtools sort - <alignedtomitochondrial_bamfile_withoutextension> -@ $NSLOTS
Note that the '.bam extension is not required in the output of samtools sort when its input and output are bam.
Re-generate readsets which will no longer map to mitochondrial genome
Actually, we're only interested in the reads that did not map to the mitochondrial, and we can extract them into the their own bam with:
samtools view -f4 -bh <alignedtomitochondrial_bamfile> > <unmappedotmito_bamfile> -@ $NSLOTS
Explanation:
- -f4 specifies that only reads with a SAM FLAG setting of 1 on the third bit will be output, This is the setting that indicates the map did not not align. Essentially, this is the key part of this whole stage, it enables us obtain the reads which did not map to the mitichondrial reference, and also avoid those that did.
We can convert this extracted alignment back into FASTQ files with picard-tools and then compress as follows:
java -Xmx16g -jar $PICARDJARFILE SamToFastq I=${UBAM} F=${FQN1%.*} F2=${FQN2%.*} pigz -f -p $NSLOTS ${FQN1%.*} pigz -f -p $NSLOTS ${FQN2%.*}
These new reads can be aligned to the true reference
bwa mem -t $NSLOTS $REFLOC $FQN1 $FQN2 > $REASAM
The raw sam can be sorted and converted to bam to produce the alignment with the reads mapping to the mitochondrial reference excluded as follows:
samtools view -bSh ${REASAM} | samtools sort - ${REABAM%.*} -@ $NSLOTS
Marking Duplicates
As usual, the picard-tools function will be used for this. It can operate on multiple BAM files in one run, so we may apply it on a per-sample basis because we have several datasets for each sample in this pipeline.
java -Xmx32G -jar $PICARDJARFILE MarkDuplicates REMOVE_DUPLICATES=true I=<bamfile1> I=<bamfile2> I=... OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 O=<de-duplicated_bamfile> M=<marked_dup_metrics.txt>
Both the O (output) and M (metrics) options (filenames with relative directory paths) are chosen by the user.
We also generate an index for this bamfile which helps tools in downstream analysis.
samtools index <de-duplicated_bamfile> <de-duplicated_baifile>
Realigning after duplicate identification
An initial "target creator" step is required first because the IndelRealigner has a -targetIntervals option.
java -Xmx16G -jar /path/to/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ${REFSEQLOC} -I <de-duplicated_bamfile> -o <chosen_targetcreator_intervals_filename>
Now the IndelRealigner tool can be invoked:
java -Xmx16G -Djava.io.tmpdir=/tmp -jar /path/to/GenomeAnalysisTK.jar -T IndelRealigner -R ${REFSEQLOC} -targetIntervals <chosen_targetcreator_intervals_filename> -I <de-duplicated_bamfile> -o <chosen_realigned_bamfile>
An extra bit of quality filtering is also included here:
samtools view -bh -F4 -q 30 <chosen_realigned_bamfile> <realigned_qualityfiltered_bamfile> -@ $NSLOTS
Repeatmasking using a bed file
In this case, we have already run a repeatmasking exercise on the reference genome so we may safely apply it to the quality filtered bamfile produced in the previous step using bedtools:
intersectBed -abam <realigned_qualityfiltered_bamfile> -b <bedfile> -v | samtools view -h - | samtools view -bSh - > <no_repeats_bamfile>
where the bedfile represents the repeatmasked regions.