Difference between revisions of "MinION Coverage sensitivity analysis"

From wiki
Jump to: navigation, search
(Created page with " = Subtract Operations = In the spirit of discovering new sequences, this ivolves subtract what is currently annotated from our alignment dataset. This is a highly untarget...")
 
(s2)
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
= Introduction =
  
 +
MinION is a new (at time of writing, 2017) sequencing technology capable of very long reads (sometimes, 30kbp).
 +
 +
Due to its lower maturity compared to other technologies, such as Illumina, it also is less robust, leading to scenarios where quality can be too low. In Illumina, such reads are usually discarded, but it in MinION it is less easy to make that decision, because:
 +
* longer reads inherently more valuable due to their ability to span long genomic regions
 +
* too high a quality threshold may discard the majority of reads
 +
 +
While retaining low quality sequencing is not usually a merit of an experiment where high quality reads are in abundance, it is beneficial to at least accompany these with a sensivity analysis which may help extract the usefulness of such reads.
 +
 +
= Conversion of base-called fast5 to fasta=
 +
 +
This is done with <code>poretools</code>, and can be performed. Filters for minimum read length can be employed. The average read lenght on this experiment was 4000 bp, so it was decided to use two filters, one for 1kbp and another for 2kbp. Two values would help see how valuable stringency on read length could be.
 +
 +
poretools fastq --min-length 1000 --type fwd read_directory/ >allr_2k.fastq
 +
 +
and
 +
 +
poretools fastq --min-length 2000 --type fwd read_directory/ > allr_1k.fastq
 +
 +
where:
 +
* <code>--min-length 2000</code> is the read filter, so that no reads under 20kbp are included in the fastq
 +
* <code>--type fwd</code> only includes forward reads. The grand majority were forward reads. There were very few reverse reads, so these were not included.
 +
* note how a directory is presented to the command: by default poretools will process all the fast5 files held in that directory.
 +
* the output is in fastq format and is directed to a file
  
 
= Subtract Operations =
 
= Subtract Operations =
  
In the spirit of discovering new sequences, this ivolves subtract what is currently annotated from our alignment dataset.
+
In the spirit of discovering new sequences, this involves subtracting what is currently annotated from our alignment dataset.
  
 
This is a highly untargeted approach because all currently annotated genes and interestung areas are subtracted so that the coverage can only describe currently unknown areas of the genome, including non-coding areas.
 
This is a highly untargeted approach because all currently annotated genes and interestung areas are subtracted so that the coverage can only describe currently unknown areas of the genome, including non-coding areas.
Line 12: Line 36:
  
 
= Intersect Operations =
 
= Intersect Operations =
 +
 +
This is useful to see how much of the current annotations the sequencing reads were abel to recover.
 +
 +
= Obtaining the sequence behind a range =
 +
 +
As the bed file only holds chromosome names and ranges, there is no sequence information. This is obtained with the bedtools getfasta tool. This however needs a fasta file, so the correct fasta must be obtained from the bam file, because the reference sequence could be used or sequences from the reads.

Latest revision as of 12:23, 9 April 2017

Introduction

MinION is a new (at time of writing, 2017) sequencing technology capable of very long reads (sometimes, 30kbp).

Due to its lower maturity compared to other technologies, such as Illumina, it also is less robust, leading to scenarios where quality can be too low. In Illumina, such reads are usually discarded, but it in MinION it is less easy to make that decision, because:

  • longer reads inherently more valuable due to their ability to span long genomic regions
  • too high a quality threshold may discard the majority of reads

While retaining low quality sequencing is not usually a merit of an experiment where high quality reads are in abundance, it is beneficial to at least accompany these with a sensivity analysis which may help extract the usefulness of such reads.

Conversion of base-called fast5 to fasta

This is done with poretools, and can be performed. Filters for minimum read length can be employed. The average read lenght on this experiment was 4000 bp, so it was decided to use two filters, one for 1kbp and another for 2kbp. Two values would help see how valuable stringency on read length could be.

poretools fastq --min-length 1000 --type fwd read_directory/ >allr_2k.fastq

and

poretools fastq --min-length 2000 --type fwd read_directory/ > allr_1k.fastq

where:

  • --min-length 2000 is the read filter, so that no reads under 20kbp are included in the fastq
  • --type fwd only includes forward reads. The grand majority were forward reads. There were very few reverse reads, so these were not included.
  • note how a directory is presented to the command: by default poretools will process all the fast5 files held in that directory.
  • the output is in fastq format and is directed to a file

Subtract Operations

In the spirit of discovering new sequences, this involves subtracting what is currently annotated from our alignment dataset.

This is a highly untargeted approach because all currently annotated genes and interestung areas are subtracted so that the coverage can only describe currently unknown areas of the genome, including non-coding areas.

Though this might sound unfruitful to start with, it has some benefits:

  • useful to gain a feeling of general coverage
  • reduces reliance on currently known areas, which may have been chosen becaus eof high coverage from previous studies.

Intersect Operations

This is useful to see how much of the current annotations the sequencing reads were abel to recover.

Obtaining the sequence behind a range

As the bed file only holds chromosome names and ranges, there is no sequence information. This is obtained with the bedtools getfasta tool. This however needs a fasta file, so the correct fasta must be obtained from the bam file, because the reference sequence could be used or sequences from the reads.