Directory Organization Exercise

From wiki
Jump to: navigation, search

Aims

Directory organisation has become more necessary, due the multiple intermediate files that Genomics pipelines produce, and the many output files they produce. Which result-files correspond to which samples? And to which replicate? It is very easy to lose track, especially when you return to the files after a month or two. Therefore directory and file organisation is well worth the effort. Luckily Unix, more than other system, helps in this.

This exercise bundles 16 projects, packed, in their turn, into 16 zip files.

Data taken from the excellent book "Computational Genomics" by Nello Cristianini and Matthew Hahn (ref. http://www.computational-genomics.net).

Commands

(There are several ways to undertake this task, but this one aims to make use of TAB-COMPLETION keys and HISTORY.

This exercise uses "shell" for loops, whose parts are spearated by semicolons. Semicolons do not really need an associated space character. They delimit the individual commands.

Streamlined solution

This is the streamlined version, not that one followed during the course, as it does not require the "heavy lifting" tools of "find" and "vim".

  • cd to ensure we are in our home directory
  • cp $TCH/allprojs.tar . to copy the bundle with all the projects to our home page.
  • tar -tf allprojs.tar to look inside the tar ("tape archive" file) which contains all the zip files.
  • tar -xf allprojs.tar to extract them all
  • ls to make sure they have been extracted.
  • rm allprojs.tar, by which we delete the tar file, because we have extracted all its contents.
  • for i in $(ls *.zip); do mkdir ${i%.*}; done by which we create the directories into which we plan to mv the zip files
  • for i in $(ls *.zip); do mv $i ${i%.*}; done by which we move the zip files into their corresponding directories.
  • tree this simple command verifies to us that the files have moved into the the directory they correspond to.
  • for i in $(ls -d */); do cd $i; unzip ${i%/*}.zip; rm c${i%/*}.zip; cd ..;done, by which we enter each directory in turn and extract the xip file inside, then remove it, and then move back into the parent directory

This is the end of the "streamlined" version of this answer to the exercise.

Streamlined solution

This is the streamlined version, not that one followed during the course, as it does not require the "heavy lifting" tools of "find" and "vim".

  • cd to ensure we are in our home directory
  • cp $TCH/allprojs.tar . to copy the bundle with all the projects to our home page.
  • tar -tf allprojs.tar to look inside the tar ("tape archive" file) which contains all the zip files.
  • tar -xf allprojs.tar to extract them all
  • ls to make sure they have been extracted.
  • rm allprojs.tar, by which we delete the tar file, because we have extracted all its contents.
  • for i in $(ls *.zip); do mkdir ${i%.*}; done by which we create the directories into which we plan to mv the zip files
  • for i in $(ls *.zip); do mv $i ${i%.*}; done by which we move the zip files into their corresponding directories.
  • tree this simple command verifies to us that the files have moved into the the directory they correspond to.
  • for i in $(ls -d */); do cd $i; unzip ${i%/*}.zip; rm c${i%/*}.zip; cd ..;done, by which we enter each directory in turn and extract the xip file inside, then remove it, and then move back into the parent directory

Clunky solution (useful when not all the projects are of equal interest)

This is the method used during class. It was adopted by accident, but actually it does reflect quite a common situation, described here:

Sometimes we have a bunch of projects (read: replicates of experiment) that we want to group together, but which are not all of equal interest. Perhaps because some are contaminated.

In this case (and this is the method we ended up doing during the class) we want to build up a file listing of the zip file and edit this list to only include the zip files of interest. This method is of interest because on many occasions, generating a file listing of our samples allows the most flexibility in terms of which we want to process and which we don't, So we generate a list of our files, together with their locations (in the form of paths) with the find comand.

The first part is the exact same as the streamlined version. Lets repeat the last relevant command

  • tree by which we recognise that all the zip files are inside a directory which matches their filename
  • find -iname "*.zip" > f.l we direct the output of the find command to a file called f.l which is a file listing.
  • rm allprojs.tar, by which we delete the tar file, because we have extracted all its contents.

We then use one of the popular power text editors to edit the file listing. We showed the use of "visual editing" and "macro recording" of vim to accelerate our editing of each line so that we reduced the file to a list of directories. We could have also decided here which directories we wanted to exclude.

Vim is highly capable of editing via regular expressions. However, when asked about regular expressions, class participants did not show any familiarity. Regular expressions are an highly efficient method of match text, but it is clear that they are an advanced topic, so this was omitted, and we continued with the "clunky" version of rendering the output of find into a list of directories.

vim is very popular among so-called power users and there are many cheatsheets. One such can be found at http://vim.rtorr.com.

  • vim f.l to enter the the file with the vim editor.
  • Ctrl + V to enter visual block mode. We can highlight the areas we want to delete. We decide to select the first ./ on all lines, we see them highlighted and press x to delete them
  • we type :wq to save and exit vim. We use history (up-arrow) to repleat the last command and so re-enter vim again and also loading the same file, this time with all lines missing the ./ they had previously. Note that this block deletion is not the same as the text highlight and delting doen in Microsoft Word and other windows applications, because it advances column-wise. This functionality is not usuable available. vim can also advance row-wise (which is the more normal method of selecting text) by just type v instead of ctrl+v.
  • 0f/D to delete the final / and subsequent text for each line. 0 means "go to beginning of line". f means find and / is the given character is tries to find, while D means delete the rest of the line.
  • However that is all for one line. We use vim's macro recording. q starts macro and another q stores it in register q. SO for the second line we type qq0f/Dq. This can be executed for every line because this combination of actions is stored in the q-register. We type @q, to repeat its execution. And, actually, after this, @@ will execute the last macro. So this awkward editing of a text file is using some of the vim editor's most advanced functions. You can describe it as semi-automated.
  • That all done, :wq will save and leave vim.

What we end up with is an file called f.l with a list of the directories in which the zip-files are located. So now we can execute

  • for i in $(ls -d */); do cd $i; unzip ${i%/*}.zip; rm c${i%/*}.zip; cd ..;done
  • using tree gives us a quick view that everything is as properly located and properly extracted as we wished.

Please note how we also delete the zip file after extracting its contents in this exercise, again to avoid duplication.

Apologies for not completing this exercise in time, but you will find that the clunky solution reveals a few tricks with the find and vim commands that are more widely useful. Real-life file and directory organisation is often more complicated than just 16 project zip files!