Difference between revisions of "One-liners"
m (Rf moved page Awk one-liners and scripts to One-liners) |
|||
(5 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | = Introduction = | ||
+ | Within command-line usage, there is a common situation whereby a single command-line starts to become quite complicated, and can take quite a while to build up. | ||
− | You have a genome coverage file in bedgraph format (final, fourth column | + | In bioinformatics, one is often forced into this situation as many of the tools have quite complicated options and parameter settings. |
+ | |||
+ | It also is the usual mode of building simple for loops that can turn out to be quite powerful. | ||
+ | |||
+ | = Examples = | ||
+ | |||
+ | ==Renaming multiple files== | ||
+ | |||
+ | There are various way of doing this. Here we have a complicated situation where we have made a filelisting and copied it, and manicured the names in the copy, and now want to rename each file in the original to its corresponding manicured version. The key difficulty is getting the right corresponding name in the manicured list. Here follows one such example | ||
+ | |||
+ | for i in $(cat lst0); do j=${i#*_}; R=$(printf "%s_\\w\\+_R%s\n" ${i%%_*} ${j%.*}); mv $i $(grep $R lst); done | ||
+ | |||
+ | <ins>Explanation</ins> | ||
+ | * '''lst0''' is the original filelisting | ||
+ | * '''lst''' is the manicured filelisting | ||
+ | * here we lean heavily on bash's formatted print statement, '''printf''', to generate a regular expression which can be submitted to grep. | ||
+ | * the regular expression should be designed to guarantee a unique entry in the manicured filelisting. | ||
+ | * '''printf -v R''' is another way of assigning the formatted regex to the variable '''$R'''. | ||
+ | |||
+ | = One-liners featuring awk = | ||
+ | |||
+ | You have a genome coverage file in bedgraph format (with the final, fourth column giving the coverage for a particular section) and would like to find the max coverage value across all the ranges: | ||
awk '{if($4>mxc) mxc=$4} END {print mxc}' v30chronly_s.cov | awk '{if($4>mxc) mxc=$4} END {print mxc}' v30chronly_s.cov | ||
Line 10: | Line 33: | ||
Note how in the above, for '''la''' (last), we want the third column on the last line which is the endpoint, but we don't know when it will occur so the third column on all lines get assigned to this variable. This will not do if there is more than one chromosome. | Note how in the above, for '''la''' (last), we want the third column on the last line which is the endpoint, but we don't know when it will occur so the third column on all lines get assigned to this variable. This will not do if there is more than one chromosome. | ||
+ | |||
+ | Given a multifasta fasta file, you are able to list out the ID lines, Now you want a certain sequence. the following will list out the fourth sequence. | ||
+ | awk '{if(/^>/) c++; if(c==4) print}' hexaseqs.fasta | ||
+ | |||
+ | With an even better idea of the lines of interest, it's possible to use a head and tail combination. These two two are highly optmised and may quickly on very big files. However, you do need the line numbers as given by <code>grep -n ">"</code>. If the indicated lines are n1 and n2, then: | ||
+ | |||
+ | head -n2-1 <file> |tail +n1 |
Latest revision as of 21:02, 17 April 2017
Introduction
Within command-line usage, there is a common situation whereby a single command-line starts to become quite complicated, and can take quite a while to build up.
In bioinformatics, one is often forced into this situation as many of the tools have quite complicated options and parameter settings.
It also is the usual mode of building simple for loops that can turn out to be quite powerful.
Examples
Renaming multiple files
There are various way of doing this. Here we have a complicated situation where we have made a filelisting and copied it, and manicured the names in the copy, and now want to rename each file in the original to its corresponding manicured version. The key difficulty is getting the right corresponding name in the manicured list. Here follows one such example
for i in $(cat lst0); do j=${i#*_}; R=$(printf "%s_\\w\\+_R%s\n" ${i%%_*} ${j%.*}); mv $i $(grep $R lst); done
Explanation
- lst0 is the original filelisting
- lst is the manicured filelisting
- here we lean heavily on bash's formatted print statement, printf, to generate a regular expression which can be submitted to grep.
- the regular expression should be designed to guarantee a unique entry in the manicured filelisting.
- printf -v R is another way of assigning the formatted regex to the variable $R.
One-liners featuring awk
You have a genome coverage file in bedgraph format (with the final, fourth column giving the coverage for a particular section) and would like to find the max coverage value across all the ranges:
awk '{if($4>mxc) mxc=$4} END {print mxc}' v30chronly_s.cov
A genome coverage file what is the average coverage per base?
awk '{tot=tot+$4/($3-$2); la=$3} END {print tot/la}' v30chronly_s.cov
Note how in the above, for la (last), we want the third column on the last line which is the endpoint, but we don't know when it will occur so the third column on all lines get assigned to this variable. This will not do if there is more than one chromosome.
Given a multifasta fasta file, you are able to list out the ID lines, Now you want a certain sequence. the following will list out the fourth sequence.
awk '{if(/^>/) c++; if(c==4) print}' hexaseqs.fasta
With an even better idea of the lines of interest, it's possible to use a head and tail combination. These two two are highly optmised and may quickly on very big files. However, you do need the line numbers as given by grep -n ">"
. If the indicated lines are n1 and n2, then:
head -n2-1 <file> |tail +n1