Hdi2u dirorg exercise
Aims
- Directory organisation has become more necessary, due the multiple intermediate files that Genomics pipelines produce, and the many output files they produce.
- Which result-files correspond to which samples? And to which replicate? It is very easy to lose track, especially when you return to the files after a month or two.
- Therefore we include a somewhat artificial directory and file organisation exercise.
-
allprojs.tar
is a file that 16 projectsin zip format, we must lay them and their file out in orderly fashion. - Data taken from the excellent book "Computational Genomics" by Nello Cristianini and Matthew Hahn (ref.
http://www.computational-genomics.net http://www.computational-genomics.net
).
Getting into position
- To ensure we are in our home directory
cd
- to ensure we have the appropriate file
ls -l hdi2u_files/allprojs.tar
- Let's look inside the tar-file or tarball, as it's sometimes called
cd hdi2u_files tar tf allprojs.tar
- We can go ahead and extract now
tar xf allprojs.tar</code> to extract them all
- Type
ls
to make sure they have been extracted OK. - We delete the tar file, because we have extracted all its contents.
rm allprojs.tar
Putting files into a loop
- Although there are not so many zip files, there's enough to discourage us from manual handling.
- so we use a loop, which in commandline is served by the
for i in THIS; do THAT;done
idiom. - Let's first make a directory for each one, we will test first with the
echo
command.
for i in *_demo.zip; do echo "mkdir ${i%.*}"; done
- Note how it is sufficient to use the wildcard:
ls
is not required. - Note also
${i%.*}
allows us create directory names without the.zip
extension. - But, hang on, there's an extra zip in our directory which is not part of our project. We can be more exact:
for i in *_demo.zip; do echo "mkdir ${i%.*}"; done
- OK, now we can go ahead:
for i in *_demo.zip; do mkdir ${i%.*}; done
- Now we can insert the zip files into their respective directories with:
for i in *_demo.zip; do mv $i ${i%.*}; done
- We can check with ls that everything is OK, but there is a command called tree which gives (slightly) nicer output.
tree | less
- This simple command verifies to us that the files have moved into the directory they correspond to.
Operating on directories
- We now want to extract the zip files, and can use a for-loop again, but it must act on directories not files
- This will list all directories, but we're only interested in a few of them:
ls -d */
- While theres's a few ways to handle this, we will use the concept of pre-editing a listing to select what we want.
- So we store the directory listing in a file, delete unwanted directories and operate on the content of the listing.
ls -d */ > dirlist.txt vim !$
- We will find ourselves in vim and will use
dd
to delete the lines we don't want. Then typeZZ
to save and get out. - Note that this manual check slows things down, but helps confidence in what we are doing.
- Now we can build our for-loop with the confidence that we are going to operate on the right directories.
for i in $(cat dirlist.txt); do cd $i; unzip ${i%/*}.zip; rm ${i%/*}.zip; cd ..;done
- Note how we enter each directory in turn and extract the zip file inside, then remove it, and then move back into the parent directory
- We can have a look with
tree | less
And we can zip the whole structure up again with
zip -r dirstruc.zip $(ls dirlist.txt)
And delete the created directory structure itself
for i in $(cat dirlist.txt); do rm -rf $i; done
- Though this was a artificial exercise, it's common to find zip and tar files which litter your home directory, unless you create directories for them.