Difference between revisions of "Course itself"

From wiki
Jump to: navigation, search
(Created page with "= Part One: Linux = == Introduction == Genomics studies produce vast amounts of usually in the form of very large text files. ### What is a text file and how it differs from...")
(No difference)

Revision as of 11:21, 4 October 2016

Part One: Linux

Introduction

Genomics studies produce vast amounts of usually in the form of very large text files. ### What is a text file and how it differs from other files * Basically a test file is any file that is not binary. * Binary files are only readable by machines and special software. * In Unix, commands are usually binary, while their targets and arguments are usually text * MS Word might look like a text file, but actually it is binary. * In MS Windows and (less often) MAC OSX all files tend to be binary, and need a special (often not free) program to open them * This means it can be opened with “general” tools, not a certain specific one.

Linux is particularly suited to working with with such files, and is therefore arguably one of the most important tools int he bioinformatician’s toolkit. The Linux command-line enables one to view, search and maniupluate large text files that are difficult or impossible to handle with applications like MS Word, or MS Excel, write pipelines to perform certain tasks and run bioinformatics software for which no web interface is available.

The aim of this course is to provide a basic introduction to unix, covering the most used commands ,as well as briefly touching upon a few slightly more advanced topics like text processing using awk, writing small shell scripts and some biofinromatics tools (seqtk, bioawk, SAMtools and tabix).

Linux is used by the vast majority of bioinformaticians in the world. The term Linux refes to a large number of (very similar) operating systems, many of which are free. In this one day course, we will use StABU’s, the Bioinformatics Unit’s, cluster which actually runs a non-free edition of Linux called Red Hat Enterprise Linux. We shall be connecting to it remotely, via the PuTTY program hosted on your MS Windows machines. This cluster, made up of 11 powerful computers with large amounts of memory, has many of the standard Bioinformatics tools and software installed.

The real power of Unix/Linux is the command-line. The command-line is an incredibly powerful way of controlling what your computer does, and users typed submit typed commands in sequence to run differnet pieces of software. Users may also create a text file containing many commands and this is called a “shell script”.

This course is primarily based on Edinburgh Genomics “Linux for Genomics” course, which in its turn borrows from the Biolinux (another “flavour” of Unix/Linux) introduction tutorial and an Andy Law’s (Roslin Institute) presentation. Some material is also drawn from Vince Buffalo’s “Bioinformatics Data Skills” published by O’Reilly in 2015.

Links

Exercises

http://nebc.nerc.ac.uk/nebc_website_frozen/nebc.nerc.ac.uk//support/training/course-notes/past-notes/assembly_taster.html

Using the command-line

The real power of Unix/Linux systems is the command-line (also called the shell or terminal). Many programs and facilities are available through graphical options in Linux, but all programs and facilities can be accessed by the command-line. Sometimes. it is easier to do things through graphical interfaces, but, equally, sometimes it is easier using the command-line. This is especially true when you start to work with large numbers of files or are considering automating processes.

The program we’ll use today, PuTTY is not actually a Linux program, but a communications program that networks with remote computers using a system called SSH, “Secure Shell”. It also generates a screen where your typing appears and sends it to the remote computer it connects to. This was called a teletype or TTY because it sends typed user input over distances usually to a receiving computer. Essentially it was a typewriter, but with its own electronics to send the typing down a cable, as well repeating (echo’ing) it back to an attached printer or (later) screen.

A teletype that is not only capable of repeating your typing back to you, but is also capable of doing something with it, and sending you the result almost instantaneously is called a terminal or console, and one that is not a separate machine but actually implemented in software by the computer itself, is called a terminal emulator. Nevertheless many of these terms are used interchangeably.

The idea that your typing will be somehow “acted upon” or “processed” is an important one. It means that within the message you type, there are instructions. But these instructions also take the form of words. The difference will be that we want the instruction’s words to be interpreted. The convention is that the first words, from the left, of a line form the command, while the first words, from the right, are the targets of the command, usually filenames and are technically called arguments.

The command ls lists files and subdirectories in a directory

By default this command will list the filenames of the files in your current working directory. At the moment this is probably your home directory. In its simplest form, it’s one of the few commands that does not need any arguments, and, as such, is an exception.

If you add a space followed by a -l after the ls command, it is called an option and it alters the behaviour of the command - it will now list the files in your current directory, but with details about them such as who owns them, what the size is, and what can be done with it. These details are highly abbreviated so that they fit onto one line and will be explained later.

    • Exercise **

Type the command ls

Type the command ls -l

`ls`
`ls -l`

What do you see that is different?

The command man provides help for a command

There are many options you can provide with the ls command that modify what kind of information is returned to you. By typing man ls you get access to the manual page for this command. Almost all Linux commands have a manual page, and it is woth referring to them to find out what options are available. Many jobs can be made easier by using the right command options.

Open the help command for ls by typing man ls and look at some of the information provided. Close the man page by typing the letter q.

If you do not know the name of the command to use for a particular job, you can search using man -k. For example, man -k list gives a list of a number of commands that have the word “list” in their description. You can scroll through these, or try making the search more specific. For example: **man -k “list directory**" will only return four commands. You could then look at the man pages for each command to decide which was the best for the job at hand.

    • Exercise **

Type the command man cp. What does this tell you?

Type the command man mv. What does this tell you?

Type the command man rm. What does this tell you?

What is the difference between cp and mv?

Remember to hit “q” to exit from the man command

`man cp`
`man mv`
`man rm`

Type the command man touch. Without any options, what does it do? Do you think that is useful?

`man touch`

Basic Linux/Unix tips for filenames

Certain characters you are used to should not be used in filenames in Linux/Unix. Other are much more preferred. For example, preferred characters in Unix are letters, numbers, hyphens, underscores and full stops.

Unix/Linux/Unix does not deal well with spaces in filenames! Make sure your filenames do not contain them. Filenames with spaces in them are a common problem when transferring files to Linux/Unix from computers running Windows, or Mac operating systems. If you end up with filenames with spaces in them, you will need to enclose the entire filename in quotation marks so that Linux/Unix understands that the space is part of the name.

Alternatively, you can “escape” the space using a backslash. For example, if I have a file called my document Linux/Unix will see this as two filenames, “my” and “document”. But you could write either of the following to make it understand you mean a single file:

  • “my document”
  • my document

Our general advice is to change the name of such files to remove the space. A common practice is to replace the space with an underscore. We can use the recently discovered touch and mv commands for to try this out:

`touch my document`
`touch "my document"`
`ls -l my document "my document"` mydocument

Assume that everything is case specific

Linux/Unix systems consider capital letters different from lower case letters. The filename myFile is not the same as the filename Myfile or myfile. Please note that there are some common naming conventions in place for biological data that you should try to follow.

Changing directories

Tab completion for commands and filenames

Command history

Making and removing (empty) directories

Text editors

Reading text files

Copying files

Removing directories

Piping and outputting to files

Grep

What permissions mean

Head and tail

Redirection

Working with zipped data

Some other useful information

Stopping processes

Clearing the terminal

Copying and pasting text

Environment Variables

The FASTQ format

FASTQ on the command line

Using paste to manipulate fastq

awk for data in columns

FASTQ to FASTA conversion

Process Management

Simple shell scripts

If you finish and you are bored

Two: Command-line tools for Genomics

Bioawk

seqtk

samtools

tabix