CRAM is a (further) compressed form of a bam file. It's necessary due the high numbers of bam files being generated in NGS.
Samtools is able to handle CRAM files, but the EBI's ENA department has produced a JAVA based toolset called cramtools, available on marvin with
module load cramtools
There is a wrapper script called cramtools.sh which may be used, it is simply a short cut to the main program.
The main help is as follows:
Version 3.0-b203 Usage: cramtools [options] [command] [command options] Options: -h, --help Print help and quit (default: false) Commands: bam CRAM to BAM conversion. cram BAM to CRAM converter. index BAM/CRAM indexer. merge Tool to merge CRAM or BAM files. fastq CRAM to FastQ dump conversion. fixheader A tool to fix CRAM header without re-writing the whole file. getref Download reference sequences. qstat Quality score statistics.
Cramtools has more than just converting between CRAM and BAM. In particular it can download the reference sequence associated with the CRAM, via its subcommand getref.
Usage: <main class> [options] A list of MD5 checksums for which the sequences should be downloaded. Options: --destination-file, -F Destination file. --fasta-line-length Wrap fasta lines accroding to this value. (default: 80) --gzip, -z Compress fasta with gzip. (default: false) --ignore-not-found Don't fail on not found sequences, just issue a warning. (default: false) --input-file, -I The path to the CRAM or BAM file to extract sequence MD5 checksums. -h, --help Print help and quit (default: false) -l, --log-level Change log level: DEBUG, INFO, WARNING, ERROR. (default: ERROR)
So we can invoke it as follows:
java -jar $CRAMJARFILE getref -I 17261_1#45.cram >_45.fa
The top 3 lines of the output is as follows:
>ENA|CU458896|CU458896.1 ffb37b1f4cfc02b01cc8f3cafebf1e8e TTGACTGACGAACTGAATTCCCAGTTCACGGCGGTATGGAATACCGTCGTCGCAGAGCTCAACGGTGACGACAATCAATA TCTGTCGAGCTTCCCGCCGCTGACCCCGCAACAGCGCGCCTGGCTTACCCTCGTCAAACCACTCACCATGGCCGAGGGTT
So we can see it is a standard FASTA file with a code that looks like an MD5 code on the ID line. This code in fact, can also be used to obtain the same reference file via another method, as detail on the ENA website
This will not download the ID line, just the raw sequence with no newline-style formatting for easy visualisation.