Hdi2u dirorg exercise

From wiki
Revision as of 16:34, 19 April 2017 by Rf (talk | contribs) (n)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Aims

  • Directory organisation has become more necessary, due the multiple intermediate files that Genomics pipelines produce, and the many output files they produce.
  • Which result-files correspond to which samples? And to which replicate? It is very easy to lose track, especially when you return to the files after a month or two.
  • Therefore we include a somewhat artificial directory and file organisation exercise.
  • allprojs.tar is a file that 16 projectsin zip format, we must lay them and their file out in orderly fashion.
  • Data taken from the excellent book "Computational Genomics" by Nello Cristianini and Matthew Hahn (ref. http://www.computational-genomics.net http://www.computational-genomics.net).

Getting into position

  • To ensure we are in our home directory
cd
  • to ensure we have the appropriate file
ls -l hdi2u_files/allprojs.tar
  • Let's look inside the tar-file or tarball, as it's sometimes called
cd hdi2u_files
tar tf allprojs.tar
  • We can go ahead and extract now
tar xf allprojs.tar</code> to extract them all
  • Type ls to make sure they have been extracted OK.
  • We delete the tar file, because we have extracted all its contents.
rm allprojs.tar

Putting files into a loop

  • Although there are not so many zip files, there's enough to discourage us from manual handling.
  • so we use a loop, which in commandline is served by the for i in THIS; do THAT;done idiom.
  • Let's first make a directory for each one, we will test first with the echo command.
for i in *_demo.zip; do echo "mkdir ${i%.*}"; done
  • Note how it is sufficient to use the wildcard: ls is not required.
  • Note also ${i%.*} allows us create directory names without the .zip extension.
  • But, hang on, there's an extra zip in our directory which is not part of our project. We can be more exact:
for i in *_demo.zip; do echo "mkdir ${i%.*}"; done
  • OK, now we can go ahead:
for i in *_demo.zip; do mkdir ${i%.*}; done
  • Now we can insert the zip files into their respective directories with:
for i in *_demo.zip; do mv $i ${i%.*}; done
  • We can check with ls that everything is OK, but there is a command called tree which gives (slightly) nicer output.
tree | less
  • This simple command verifies to us that the files have moved into the directory they correspond to.

Operating on directories

  • We now want to extract the zip files, and can use a for-loop again, but it must act on directories not files
  • This will list all directories, but we're only interested in a few of them:
ls -d */
  • While theres's a few ways to handle this, we will use the concept of pre-editing a listing to select what we want.
  • So we store the directory listing in a file, delete unwanted directories and operate on the content of the listing.
ls -d */ > dirlist.txt
vim !$
  • We will find ourselves in vim and will use dd to delete the lines we don't want. Then type ZZ to save and get out.
  • Note that this manual check slows things down, but helps confidence in what we are doing.
  • Now we can build our for-loop with the confidence that we are going to operate on the right directories.
for i in $(cat dirlist.txt); do cd $i; unzip ${i%/*}.zip; rm ${i%/*}.zip; cd ..;done
  • Note how we enter each directory in turn and extract the zip file inside, then remove it, and then move back into the parent directory
  • We can have a look with
tree | less

And we can zip the whole structure up again with

zip -r dirstruc.zip $(ls dirlist.txt)

And delete the created directory structure itself

for i in $(cat dirlist.txt); do rm -rf $i; done
  • Though this was a artificial exercise, it's common to find zip and tar files which litter your home directory, unless you create directories for them.