Hdi2u rendertotsv exercise

From wiki
Jump to: navigation, search


  • Rendering output data to a spreadsheet can be very useful, as it seems to make the data more digestible.
  • Spreadsheet programs like Excel, do have alot of calculating and graphics functionality, but can only cope with reduced datasets.
  • In this exercise we'll examine how to get Linux data into a layout and format that Excel can read.

Part 1: Combine many to 1

  • Although we get many large monolithic sequences, we also get very small ones.
  • many tools will out many small files, which some how need combining
  • In this exercise we have very many files which all must go into one spreadsheet

What do we have?

  • A compressed zip file, utr.zip. Can we look without decompressing?
  • Decompress and take a look at one file. How many columns?
  • How uniform are the files?

= Exercise:

Notes: TSV, means tab seperated columns and is very close to being a spreasheet.

Surely, we can concatenate them all into one file and give it a *.tsv extension

We can try it:

cat *.txt > ourss.tsv

Let's have a look at it. WHat do you think?

We'll get rid of the blank lines amd also include the name of the file as a header.

for i in `cat f.lst`; do echo $i >> all.txt; sed -e /^$/d $i >> all.txt; echo >> all.txt; done


  • the empty lines in the files were a it annoying, but empty lines and also blank columns

are common in old bioinformatics files, it makes visual inspection easier.

  • Can we use the meta data inherent in the file names

Part 2: Background

  • From a reference sequence annotation of an organism with 14 chromosomes, coordinates for a gene family KIR have been isolated.
  • A mapping of reads to this region, and a bedtools intersect operation was carried out, after which coordinates were converted to lengths.
  • File: a list of reference sequence genes and their sizes: kir_ref_sizes.tsv
  • 3 files: Read intersects filtered for 3 different coverage levels:

Actually we can refer to these three almost as if they were one file:


Key problem

  • We want to view this in a spreadsheet, with reference gene lengths and read coverage length in columns, so we can visually calculate percentage coverage.
  • Some genes in the higher coverage level reads will have been filtered, i.e. each gene has row: absent rows
  • Of course we could manually re-arrange in a spreadsheet, or mentally build up some command lines
  • We can lean on grep, it will print out occurences in all files of a certain string (gene name)

Look at our data

head kir_ref_sizes.tsv kirreads_covlev{1,2,3}.tsv


grep chr01_0290 kir_ref_sizes.tsv kirreads_covlev{1,2,3}.tsv
grep chr02_1410 kir_ref_sizes.tsv kirreads_covlev{1,2,3}.tsv
wc -l kirreads_covlev{1,2,3}.tsv
  • Notice how the results come out in separate row, and often not the same number of rows.
  • The decreasing coverage sizes show us the filter at work, so we know we are working with the right data at least.

Start with a very basic for-loop

  • We can start by picking out the gene names from the reference
  • But it more than just a list of names, it has values too which must be cut out by using cut
for i in $(cut -f1 geneskir_ref_sizes.tsv); do echo $i;done
  • We prefer a cleaner gene name, and perhaps we not interested in seeing the whole list
for i in $(cut -f1 kir_ref_sizes.tsv); do echo ${i%.*};done | head

${i%.*} is a brace expansion: it strips the variable from the last dot onwards.

  • Fine, so we're ready now to use grep instead of echo here.
  • We need to feed it the reference file and 3 read coverage files:
for i in $(cut -f1 kir_ref_sizes.tsv); \
   do grep ${i%.*} kir_ref_sizes.tsv kirreads_covlev{1,2,3}.tsv; \
   echo;done | head
  • Note above how we inserted an echo which gives us a blank line, this will be useful
  • Remove | head and add > kirgenecov.txt to capture into a file.
  • Have a look at kirgenecov.txt: how well it looking?
  • It's not great on the eyes, but we can indeed verify that all the genes are matched up properly, which is useful.

Crank up the complexity: conditionals

  • We only want the gene names for the first column
  • We need to identify which column we are on and choose an output depending on which column
  • We use a counter called C which we increment for each column (C++) and set to zero on a blank line ((/^$))
  • We clean up the gene name by choosing 15 characters starting from character 19: substr($1,19,15)
  • printf is formatted print, it is practially available in all languages and gives us high control over how our output looks
awk 'BEGIN{C=0}{if(/^$/) {printf "\n"; C=0} \
   else {if(C==0) {N=substr($1,19,15);printf "%s\t%i\t",N,$2; C++} \
   else {printf "%i\t",$2;C++}}}' \
   kirgenecov.txt > kirgenecov.tsv
  • Have a look at this file. It shoulld be ready for opening in a spreasheet.
  • Sure, we have to add a header row. That's not hard.
  • What about blank cells? Cn do it manuall, but there are other much bigger gene families .. we need the zeros!

Inputting zeros automatically

  • Adding zeros so in columns 3 and 4 where no values exist
  • We can use awk again, it has a variable NF, number of fields, essentially the number of columns in each row
  • We can see that when there is a row missing the final column (NF==4), we need an extra zero
  • When there are two rows missing the final column (NF==3), we need two extra zeros
  • printf "%s",$0 allows us print the whole line without the final newline
awk '{printf "%s",$0; \
   if(NF==5) printf "\n"; else if(NF==4) printf "0\n"; else if(NF==3) printf "0\t0\n"}' \
   kirgenecov.tsv > kirgenecov0.tsv
  • And that's it! All the spreadsheet needs now is a header file, adn we can play around with the number in traditional Excel fashion.
  • Final question if there's time: Let's say we wanted comma-separated values, what easy step could we take (in Command-line Navigation section).