From wiki
Jump to: navigation, search


CRAM is a (further) compressed form of a bam file. It's necessary due the high numbers of bam files being generated in NGS.

Samtools is able to handle CRAM files, but the EBI's ENA department has produced a JAVA based toolset called cramtools, available on marvin with

module load cramtools

There is a wrapper script called which may be used, it is simply a short cut to the main program.


The main help is as follows:

Version 3.0-b203 

Usage: cramtools [options] [command] [command options]
 Options:    -h, --help  Print help and quit (default: false)
   bam         CRAM to BAM conversion. 
   cram        BAM to CRAM converter. 
   index       BAM/CRAM indexer. 
   merge       Tool to merge CRAM or BAM files. 
   fastq       CRAM to FastQ dump conversion. 
   fixheader   A tool to fix CRAM header without re-writing the whole file.
   getref      Download reference sequences.
   qstat       Quality score statistics.

Cramtools has more than just converting between CRAM and BAM. In particular it can download the reference sequence associated with the CRAM, via its subcommand getref.

Usage: <main class> [options]
A list of MD5 checksums for which the sequences should be downloaded.
 Options:    --destination-file, -F  Destination file.
   --fasta-line-length     Wrap fasta lines accroding to this value. (default: 80)
   --gzip, -z              Compress fasta with gzip. (default: false)
   --ignore-not-found      Don't fail on not found sequences, just issue a warning. (default: false)
   --input-file, -I        The path to the CRAM or BAM file to extract sequence MD5 checksums.
   -h, --help              Print help and quit (default: false)
   -l, --log-level         Change log level: DEBUG, INFO, WARNING, ERROR. (default: ERROR)

So we can invoke it as follows:

java -jar $CRAMJARFILE getref -I 17261_1#45.cram >_45.fa

The top 3 lines of the output is as follows:

>ENA|CU458896|CU458896.1 ffb37b1f4cfc02b01cc8f3cafebf1e8e

So we can see it is a standard FASTA file with a code that looks like an MD5 code on the ID line. This code in fact, can also be used to obtain the same reference file via another method, as detail on the ENA website


This will not download the ID line, just the raw sequence with no newline-style formatting for easy visualisation.