Latest revision as of 13:45, 19 April 2017

Introduction

Rendering output data to a spreadsheet can be very useful, as it seems to make the data more digestible.
Spreadsheet programs like Excel, do have alot of calculating and graphics functionality, but can only cope with reduced datasets.
In this exercise we'll examine how to get Linux data into a layout and format that Excel can read.

Part 1: Combine many to 1

Although we get many large monolithic sequences, we also get very small ones.
many tools will out many small files, which some how need combining
In this exercise we have very many files which all must go into one spreadsheet

What do we have?

A compressed zip file, utr.zip. Can we look without decompressing?
Decompress and take a look at one file. How many columns?
How uniform are the files?

= Exercise:

Notes: TSV, means tab seperated columns and is very close to being a spreasheet.

Surely, we can concatenate them all into one file and give it a *.tsv extension

We can try it:

cat *.txt > ourss.tsv

Let's have a look at it. WHat do you think?

We'll get rid of the blank lines amd also include the name of the file as a header.

for i in `cat f.lst`; do echo $i >> all.txt; sed -e /^$/d $i >> all.txt; echo >> all.txt; done

Learnings:

the empty lines in the files were a it annoying, but empty lines and also blank columns

are common in old bioinformatics files, it makes visual inspection easier.

Can we use the meta data inherent in the file names

Part 2: Background

From a reference sequence annotation of an organism with 14 chromosomes, coordinates for a gene family KIR have been isolated.
A mapping of reads to this region, and a bedtools intersect operation was carried out, after which coordinates were converted to lengths.
File: a list of reference sequence genes and their sizes: kir_ref_sizes.tsv
3 files: Read intersects filtered for 3 different coverage levels:

kirreads_covlev1.tsv
kirreads_covlev2.tsv
kirreads_covlev3.tsv

Actually we can refer to these three almost as if they were one file:

kirreads_covlev{1,2,3}.tsv

Key problem

We want to view this in a spreadsheet, with reference gene lengths and read coverage length in columns, so we can visually calculate percentage coverage.
Some genes in the higher coverage level reads will have been filtered, i.e. each gene has row: absent rows
Of course we could manually re-arrange in a spreadsheet, or mentally build up some command lines
We can lean on grep, it will print out occurences in all files of a certain string (gene name)

Look at our data

head kir_ref_sizes.tsv kirreads_covlev{1,2,3}.tsv

Try:

grep chr01_0290 kir_ref_sizes.tsv kirreads_covlev{1,2,3}.tsv
grep chr02_1410 kir_ref_sizes.tsv kirreads_covlev{1,2,3}.tsv
wc -l kirreads_covlev{1,2,3}.tsv

Notice how the results come out in separate row, and often not the same number of rows.
The decreasing coverage sizes show us the filter at work, so we know we are working with the right data at least.

Start with a very basic for-loop

We can start by picking out the gene names from the reference
But it more than just a list of names, it has values too which must be cut out by using cut

for i in $(cut -f1 geneskir_ref_sizes.tsv); do echo $i;done

We prefer a cleaner gene name, and perhaps we not interested in seeing the whole list

for i in $(cut -f1 kir_ref_sizes.tsv); do echo ${i%.*};done | head

${i%.*} is a brace expansion: it strips the variable from the last dot onwards.

Fine, so we're ready now to use grep instead of echo here.
We need to feed it the reference file and 3 read coverage files:

for i in $(cut -f1 kir_ref_sizes.tsv); \
   do grep ${i%.*} kir_ref_sizes.tsv kirreads_covlev{1,2,3}.tsv; \
   echo;done | head

Note above how we inserted an echo which gives us a blank line, this will be useful

Remove | head and add > kirgenecov.txt to capture into a file.
Have a look at kirgenecov.txt: how well it looking?
It's not great on the eyes, but we can indeed verify that all the genes are matched up properly, which is useful.

Crank up the complexity: conditionals

We only want the gene names for the first column
We need to identify which column we are on and choose an output depending on which column
We use a counter called C which we increment for each column (C++) and set to zero on a blank line ((/^$))
We clean up the gene name by choosing 15 characters starting from character 19: substr($1,19,15)
printf is formatted print, it is practially available in all languages and gives us high control over how our output looks

awk 'BEGIN{C=0}{if(/^$/) {printf "\n"; C=0} \
   else {if(C==0) {N=substr($1,19,15);printf "%s\t%i\t",N,$2; C++} \
   else {printf "%i\t",$2;C++}}}' \
   kirgenecov.txt > kirgenecov.tsv

Have a look at this file. It shoulld be ready for opening in a spreasheet.
Sure, we have to add a header row. That's not hard.
What about blank cells? Cn do it manuall, but there are other much bigger gene families .. we need the zeros!

Inputting zeros automatically

Adding zeros so in columns 3 and 4 where no values exist
We can use awk again, it has a variable NF, number of fields, essentially the number of columns in each row
We can see that when there is a row missing the final column (NF==4), we need an extra zero
When there are two rows missing the final column (NF==3), we need two extra zeros
printf "%s",$0 allows us print the whole line without the final newline

awk '{printf "%s",$0; \
   if(NF==5) printf "\n"; else if(NF==4) printf "0\n"; else if(NF==3) printf "0\t0\n"}' \
   kirgenecov.tsv > kirgenecov0.tsv

And that's it! All the spreadsheet needs now is a header file, adn we can play around with the number in traditional Excel fashion.
Final question if there's time: Let's say we wanted comma-separated values, what easy step could we take (in Command-line Navigation section).

@@ Line 1: / Line 1: @@
 = Introduction =
+* Rendering output data to a spreadsheet can be very useful, as it seems to make the data more digestible.
+* Spreadsheet programs like Excel, do have alot of calculating and graphics functionality, but can only cope with reduced datasets.
+* In this exercise we'll examine how to get Linux data into a layout and format that Excel can read.
+= Part 1: Combine many to 1 =
+* Although we get many large monolithic sequences, we also get very small ones.
+* many tools will out many small files, which some how need combining
+* In this exercise we have very many files which all must go into one spreadsheet
+= What do we have? =
+* A compressed zip file, <code>utr.zip</code>. Can we look without decompressing?
+* Decompress and take a look at one file. How many columns?
+* How uniform are the files?
+= Exercise:
+Notes: TSV, means tab seperated columns and is very close to being a spreasheet.
+Surely, we can concatenate them all into one file and give it a *.tsv extension
+We can try it:
+ cat *.txt > ourss.tsv
+Let's have a look at it. WHat do you think?
+We'll get rid of the blank lines amd also include the name of the file as a header.
+ for i in `cat f.lst`; do echo $i >> all.txt; sed -e /^$/d $i >> all.txt; echo >> all.txt; done
+Learnings:
+* the empty lines in the files were a it annoying, but empty lines and also blank columns
+are common in old bioinformatics files, it makes visual inspection easier.
+* Can we use the meta data inherent in the file names
+= Part 2: Background =
 * From a reference sequence annotation of an organism with 14 chromosomes, coordinates for a gene family KIR have been isolated.

Difference between revisions of "Hdi2u rendertotsv exercise"

Latest revision as of 13:45, 19 April 2017

Contents

Introduction

Part 1: Combine many to 1

What do we have?

Part 2: Background

Key problem

Look at our data

Start with a very basic for-loop

Crank up the complexity: conditionals

Inputting zeros automatically

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools