Hdi2u rendertotsv exercise
Contents
Introduction
- Rendering output data to a spreadsheet can be very useful, as it seems to make the data more digestible.
- Spreadsheet programs like Excel, do have alot of calculating and graphics functionality, but can only cope with reduced datasets.
- In this exercise we'll examine how to get Linux data into a layout and format that Excel can read.
Part 1: Combine many to 1
- Although we get many large monolithic sequences, we also get very small ones.
- many tools will out many small files, which some how need combining
- In this exercise we have very many files which all must go into one spreadsheet
What do we have?
- A compressed zip file,
utr.zip
. Can we look without decompressing? - Decompress and take a look at one file. How many columns?
- How uniform are the files?
= Exercise:
Notes: TSV, means tab seperated columns and is very close to being a spreasheet.
Surely, we can concatenate them all into one file and give it a *.tsv extension
We can try it:
cat *.txt > ourss.tsv
Let's have a look at it. WHat do you think?
We'll get rid of the blank lines amd also include the name of the file as a header.
for i in `cat f.lst`; do echo $i >> all.txt; sed -e /^$/d $i >> all.txt; echo >> all.txt; done
Learnings:
- the empty lines in the files were a it annoying, but empty lines and also blank columns
are common in old bioinformatics files, it makes visual inspection easier.
- Can we use the meta data inherent in the file names
Part 2: Background
- From a reference sequence annotation of an organism with 14 chromosomes, coordinates for a gene family KIR have been isolated.
- A mapping of reads to this region, and a bedtools intersect operation was carried out, after which coordinates were converted to lengths.
- File: a list of reference sequence genes and their sizes:
kir_ref_sizes.tsv
- 3 files: Read intersects filtered for 3 different coverage levels:
kirreads_covlev1.tsv kirreads_covlev2.tsv kirreads_covlev3.tsv
Actually we can refer to these three almost as if they were one file:
kirreads_covlev{1,2,3}.tsv
Key problem
- We want to view this in a spreadsheet, with reference gene lengths and read coverage length in columns, so we can visually calculate percentage coverage.
- Some genes in the higher coverage level reads will have been filtered, i.e. each gene has row: absent rows
- Of course we could manually re-arrange in a spreadsheet, or mentally build up some command lines
- We can lean on grep, it will print out occurences in all files of a certain string (gene name)
Look at our data
head kir_ref_sizes.tsv kirreads_covlev{1,2,3}.tsv
Try:
grep chr01_0290 kir_ref_sizes.tsv kirreads_covlev{1,2,3}.tsv grep chr02_1410 kir_ref_sizes.tsv kirreads_covlev{1,2,3}.tsv wc -l kirreads_covlev{1,2,3}.tsv
- Notice how the results come out in separate row, and often not the same number of rows.
- The decreasing coverage sizes show us the filter at work, so we know we are working with the right data at least.
Start with a very basic for-loop
- We can start by picking out the gene names from the reference
- But it more than just a list of names, it has values too which must be cut out by using
cut
for i in $(cut -f1 geneskir_ref_sizes.tsv); do echo $i;done
- We prefer a cleaner gene name, and perhaps we not interested in seeing the whole list
for i in $(cut -f1 kir_ref_sizes.tsv); do echo ${i%.*};done | head
${i%.*}
is a brace expansion: it strips the variable from the last dot onwards.
- Fine, so we're ready now to use grep instead of echo here.
- We need to feed it the reference file and 3 read coverage files:
for i in $(cut -f1 kir_ref_sizes.tsv); \ do grep ${i%.*} kir_ref_sizes.tsv kirreads_covlev{1,2,3}.tsv; \ echo;done | head
- Note above how we inserted an echo which gives us a blank line, this will be useful
- Remove
| head
and add> kirgenecov.txt
to capture into a file. - Have a look at
kirgenecov.txt
: how well it looking? - It's not great on the eyes, but we can indeed verify that all the genes are matched up properly, which is useful.
Crank up the complexity: conditionals
- We only want the gene names for the first column
- We need to identify which column we are on and choose an output depending on which column
- We use a counter called C which we increment for each column (
C++
) and set to zero on a blank line ((/^$)
) - We clean up the gene name by choosing 15 characters starting from character 19:
substr($1,19,15)
-
printf
is formatted print, it is practially available in all languages and gives us high control over how our output looks
awk 'BEGIN{C=0}{if(/^$/) {printf "\n"; C=0} \ else {if(C==0) {N=substr($1,19,15);printf "%s\t%i\t",N,$2; C++} \ else {printf "%i\t",$2;C++}}}' \ kirgenecov.txt > kirgenecov.tsv
- Have a look at this file. It shoulld be ready for opening in a spreasheet.
- Sure, we have to add a header row. That's not hard.
- What about blank cells? Cn do it manuall, but there are other much bigger gene families .. we need the zeros!
Inputting zeros automatically
- Adding zeros so in columns 3 and 4 where no values exist
- We can use awk again, it has a variable NF, number of fields, essentially the number of columns in each row
- We can see that when there is a row missing the final column (NF==4), we need an extra zero
- When there are two rows missing the final column (NF==3), we need two extra zeros
-
printf "%s",$0
allows us print the whole line without the final newline
awk '{printf "%s",$0; \ if(NF==5) printf "\n"; else if(NF==4) printf "0\n"; else if(NF==3) printf "0\t0\n"}' \ kirgenecov.tsv > kirgenecov0.tsv
- And that's it! All the spreadsheet needs now is a header file, adn we can play around with the number in traditional Excel fashion.
- Final question if there's time: Let's say we wanted comma-separated values, what easy step could we take (in Command-line Navigation section).