Key Aspects of R Exercise
Aims
R is a powerful programming language and software environment for statistical computing and graphics. It is also open source, meaning htat it is free to use and has a wide community of developers that have produced many add-on packages and libraries that be easily accessed.
In this part you will learn to: use basic R commands You will use the following tools, which have been pre-installed on our bioinformatics training server at the University of Edinburgh: R: http://www.r-project.org You will use the following files: genomes.csv: information about a number of animal genomes
> Type text like this in the terminal at the > command prompt, then press the [Enter] key to run the command. For most commands the output is shown below the command.
Go to the directory 08_Introduction_to_r:
$ cd ~/Data/08_Introduction_to_r
The basics
To start, type R and hit [Enter]:
R
First, check what the working directory of R is using the getwd() function:
> getwd() [1] "/home/training"
The working directory can be changed using setwd().
R can be used as a glorified calculator:
> 5*5 + 10/2 [1] 30
R respects standard mathematical rules: multiplication and division are carried out before addition and subtraction. Operations can be grouped with round brackets, so they are carried out first.
Values can be assigned to a variable using the <- assignment operator:
> a <- 5
Alternatively, it is also is possible to write 5 -> a or a = 5, but a <- 5 is the conventional and preferred way. It can be good to remind yourself that you are actually in the R environment.
To print out a variable, just type it:
> a [1] 5
R functions are invoked by their name, followed by parentheses that contain zero or more arguments.
All existing variables can be listed using the ls() function:
> ls() [1] "a"
Variables can be removed using the rm() function:
> rm(a) > ls() character(0)
R contains extensive documentation, which can be accessed using the ? operator or help() function:
> ?ls > help(ls)
The documentation is also available online: http://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.
Data types
There are various different data types in R. One of the most fundamental is the vector, which an array or a list of numbers or words. That is a description, but in R it is best to use the term vector for these things. For example a simple number, is seen by R as a vector of size 1.
R's data frames are also fundamental, though they are made up of vectors. They are used to store tables of data.
As analysis get complicated, many packages assemble their own data types made up of multiple vectors and dataframes often of different sizes. R often calls these objects, but tey can also be called data structures, which is more descriptive, as they are structures. Using that analogy we could call vectors apartments, data-frame buildings of apartments, and data structures apartment building complexes.
Vectors
A vector is a sequence of data elements of the same basic type (e.g. numbers or character strings). The elements of a vector are called components.
Vectors can be created in several ways, i.e. by specifying a range or by using the c() or seq() function:
> x <- 1:10 > x [1] 1 2 3 4 5 6 7 8 9 10 > y <- c(1,1,2,3,5,8,13,21,34,55) > y [1] 1 1 2 3 5 8 13 21 34 55 > z <- seq(from = 2, to = > z [1] 2 4 6 8 10 [20] 40 42 44 46 48 [39] 78 80 82 84 86
100, by = 2) 12 50 88
14 52 90
16 54 92
18 56 94
20 58 96
22 24 60 62 98 100
26 64
28 66
30 68
32 70
34 72
36 74
38 76
Note that the bracketed number in the output is showing the indices of the components of the vector.
The components of a vector can be accessed through their indices:
> z[1] [1] 2 > z[c(1,10,20)] [1] 2 20 40 > z[1:20] [1] 2 4
6
8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
Data frames
Data frames are the most common data type in R. A data frame is basically a table that contains vectors of equal length. The key is that these vectors are what you would normally think of as the columns, not the rows, of the table. The top line of the table, called the header, contains the column names. Each horizontal line afterward denotes a data row, which begins with the name of the row, followed by the actual data. Each data member of a row is called a cell. The data in the columns of a data frame can be of various types., i.e. one column might contain numbers while another might contain text. The columns and rows of a data frame are also referred to as variables and observations, respectively.
Creating a data frame
In the following example you will create a data frame from a comma-separated (.csv) file. The genomes.csv file (which is in the directory /hom e/training/08_Introduction_to_r) contains information about a number of animal genomes from the Animal Genome Size Database, i.e. the scientific (Latin) name of the species, the taxonomic class it belongs to, the number of chromosomes and its size:
,name,class,chromosomes,size African clawed frog,Xenopus laevis,Amphibia,36,3.09 Atlantic cod,Gadus morhua,Osteichthyes,46,0.4 Brown rat,Rattus norvegicus,Mammalia,42,3.05 Chimpanzee,Pan troglodytes,Mammalia,48,3.76 Coelacanth,Latimeria chalumnae,Osteichthyes,48,3.61 ...
Note that the first line in the file contains the header names and that the lines containing the actual data all start with the common name of the species ("African clawed frog", "Atlantic cod" etc.)
Create a data frame named "genomes" from the file genomes.csv using read.csv():
> genomes <- read.csv("genomes.csv", row.names = 1)
where: genomes.csv: the file from which the data frame is created row.names = 1: the contents of the first column in the genomes.csv file ("African clawed frog", "Atlantic cod" etc.) are used as row names The function read.csv is a more specific version of the generic read.table function, with a number of default arguments set specifically for use with comma-separated files. There is also a version for use with tab-separated files, i.e. read.delim.
Display the internal structure of the genomes data frame using str():
>str(genomes) 'data.frame': 20 obs. of 4 variables: $ name : Factor w/ 20 levels "Aedes aegypti",..: 20 10 19 18 14 9 5 11 6 17 ... $ class : Factor w/ 6 levels "Amphibia" "Aves",..: 1 5 4 4 5 4 4 2 4 4 ... $ chromosomes: int 36 46 42 48 48 38 60 78 78 54 ... $ size : num 3.09 0.4 3.05 3.76 3.61 2.91 3.7 1.25 2.8 3.06 ...
where:
- Factor: a vector whose elements can take on one of a specific set of values ("levels")
- int: integer (whole number)
- num: numeric (decimal number)
Retrieve the dimensions of the data frame using dim():
> dim(genomes) [1] 20 4
This means that the data frame consists of 20 rows (observations) and 4 columns (variables). Note that the header and the column with the row names are not included in this count.
To print the whole data frame just type its name:
> genomes name class chromosomes size African clawed frog Xenopus laevis Amphibia 36 3.09 Atlantic cod Gadus morhua Osteichthyes 46 0.40 Brown rat Rattus norvegicus Mammalia 42 3.05 Chimpanzee Pan troglodytes Mammalia 48 3.76 Coelacanth Latimeria chalumnae Osteichthyes 48 3.61 Domestic cat Felis catus Mammalia 38 2.91 Domestic cattle Bos taurus Mammalia 60 3.70 Domestic chicken Gallus domesticus Aves 78 1.25 Domestic dog Canis familiaris Mammalia 78 2.80 Duck-billed platypus Ornithorhynchus anatinus Mammalia 54 3.06 Fruit fly Drosophila melanogaster Insecta 8 0.18 Green anole Anolis carolinensis Reptilia 36 3.06 Human Homo sapiens Mammalia 46 3.50 Malaria mosquito Anopheles gambiae Insecta 6 0.27 Mallard Anas platyrhynchos Aves 80 1.44 Mouse Mus musculus Mammalia 40 3.25 Three-spined stickleback Gasterosteus aculeatus Osteichthyes 42 0.70 Turkey Meleagris gallopavo Aves 80 1.40 Yellow fever mosquito Aedes Aegypti Insecta 6 0.81 Zebrafish Danio rerio Osteichthyes 48 1.80
To print only the first or last rows use head() and tail(), respectively:
> head(genomes) name class chromosomes size African clawed frog Xenopus laevis Amphibia 36 3.09 Atlantic cod Gadus morhua Osteichthyes 46 0.40 Brown rat Rattus norvegicus Mammalia 42 3.05 Chimpanzee Pan troglodytes Mammalia 48 3.76 Coelacanth Latimeria chalumnae Osteichthyes 48 3.61 Domestic cat Felis catus Mammalia 38 2.91 > tail(genomes) name class chromosomes size Mallard Anas platyrhynchos Aves 80 1.44 Mouse Mus musculus Mammalia 40 3.25 Three-spined stickleback Gasterosteus aculeatus Osteichthyes 42 0.70 Turkey Meleagris gallopavo Aves 80 1.40 Yellow fever mosquito Aedes Aegypti Insecta 6 0.81 Zebrafish Danio rerio Osteichthyes 48 1.80
Retrieving data The column vectors can be retrieved in various ways, i.e. using double brackets [[]], the $ operator, or single brackets []:
> genomes3 [1] 36 46 42 48 48 38 60 78 78 54
8 36 46
6 80 40 42 80
6 48
> genomes"chromosomes" [1] 36 46 42 48 48 38 60 78 78 54
8 36 46
6 80 40 42 80
6 48
> genomes$chromosomes [1] 36 46 42 48 48 38 60 78 78 54
8 36 46
6 80 40 42 80
6 48
> genomes[,3] [1] 36 46 42 48 48 38 60 78 78 54
8 36 46
6 80 40 42 80
6 48
> genomes[,"chromosomes"] [1] 36 46 42 48 48 38 60 78 78 54
8 36 46
6 80 40 42 80
6 48
Note that when using single brackets, the column number (or name) is prepended with a comma character, which signals a wildcard match for the row position (meaning that we retrieve the value for the column indicated for every row).
Column slices can be retrieved using single brackets []:
> genomes[3] African clawed frog Atlantic cod Brown rat ...
chromosomes 36 46 42
> genomes["chromosomes"] African clawed frog Atlantic cod Brown rat ...
chromosomes 36 46 42
> genomes[c(3,4)] African clawed frog Atlantic cod Brown rat ...
chromosomes 36 46 42
size 3.09 0.40 3.05
As can row slices:
> genomes[13,] name class chromosomes size Human Homo sapiens Mammalia 46 3.5 > genomes["Human",] name class chromosomes size Human Homo sapiens Mammalia 46 3.5
And combinations of both:
> genomes[13,3] [1] 46 > genomes["Human","chromosomes"] [1] 46
Subsets of data can be selected using which():
> genomes[which(genomes$class == "Mammalia"),] name class chromosomes size Brown rat Rattus norvegicus Mammalia 42 3.05 Chimpanzee Pan troglodytes Mammalia 48 3.76 Domestic cat Felis catus Mammalia 38 2.91 Domestic cattle Bos taurus Mammalia 60 3.70 Domestic dog Canis familiaris Mammalia 78 2.80 Duck-billed platypus Ornithorhynchus anatinus Mammalia 54 3.06 Human Homo sapiens Mammalia 46 3.50 Mouse Mus musculus Mammalia 40 3.25 > genomes[which(genomes$class == "Mammalia" & name Chimpanzee Pan troglodytes Domestic cattle Bos taurus Domestic dog Canis familiaris Duck-billed platypus Ornithorhynchus anatinus
genomes$chromosomes > 46),] class chromosomes size Mammalia 48 3.76 Mammalia 60 3.70 Mammalia 78 2.80 Mammalia 54 3.06
> genomes[which(genomes$class == "Mammalia" & genomes$chromosomes > 46), "name"] [1] "Pan troglodytes" "Bos taurus" [3] "Canis familiaris" "Ornithorhynchus anatinus"
Logical operators == is equal to != is not equal to > is greater than < is less than >= is greater than or equal to <= is less than or equal to & and | or
Plotting Plots are an important part of statistics, so it is no surprise that R has many plotting facilities.
Let's plot for the 20 species in our genomes data frame the genome size against the number of chromosomes (i.e. the genome size on the y-axis and the number of chromosomes on the x-axis) using plot():
> plot(genomes$chromosomes, genomes$size)
where: genomes$chromosomes: the x coordinates of points in the plot genomes$size: the y coordinates of points in the plot
You can make plots as elaborate as you want.
To add titles for the plot and labels for the axes as well as base the colour of the data points on the taxonomic class to which the species belongs:
> plot(genomes$chromosomes, genomes$size, main = "Genome Size vs. Number of Chromosomes for 20 Animal Species", xlab = "Number of Chromosomes", ylab = "Genome Size (Gb)", col = genomes$class)
where:
main = "Genome Size vs. Number of Chromosomes for 20 Animal Species": an overall title for the plot xlab = "Number of Chromosomes": a title for the x-axis ylab = "Genome Size (Gb)": a title for the y-axis col = genomes$class: variable on which the colour of the data points is based
To add labels to the 20 points in the plot, use text():
> text(genomes$chromosomes, genomes$size, row.names(genomes), cex = 0.5, pos = 1)
where: genomes$chromosomes, genomes$size: numeric vectors of coordinates where the text should be written row.names(genomes): character vector specifying the text to be written cex = 0.5: character size pos = 1: a position specifier for the text (1 means below)
To save the plot as a Portable Network Graphics (.png) file, first create a file using png() :
> png('size_vs_chrom.png')
Then run the commands to generate the plot:
> plot(genomes$chromosomes, genomes$size, main = "Genome Size vs. Number of Chromosome for 20 Animal Species", xlab = "Number of Chromosomes", ylab = "Genome Size (Gb)", col = genomes$class)
> text(genomes$chromosomes, genomes$size, row.names(genomes), cex = 0.5, pos = 1)
And finally close the file using dev.off():
> dev.off()
This saves the file size_vs_chrom.png in the working directory (/home/training).
Writing to files The content of a data frame can be written to a file using write.table():
> write.table(genomes, "genomes.txt")
This saves the file genomes.txt in the working directory (/home/training).
To quit R, use quit():
> quit()
Useful resources
- The R Project for Statistical Computing: http://www.r-project.org
- Quick R Homepage: http://www.statmethods.net
- A (very) short introduction to R: http://cran.r-project.org/doc/contrib/Torfs+Brauer-Short-R-Intro.pdf
- An Introduction to R (long!): http://cran.r-project.org/doc/manuals/R-intro.html
- Bioconductor: http://www.bioconductor.org
- Stack Overflow (forum): http://stackoverflow.com
- Biostars (forum): http://www.biostars.org