Difference between revisions of "Key Aspects of R Exercise"

From wiki
Jump to: navigation, search
Line 1: Line 1:
 
= Aims =
 
= Aims =
  
R is a powerful programming language and software environment for statistical computing and graphics. It is also open source, meaning htat it is free to use and has a wide community of developers that have produced many add-on packages and libraries that be easily accessed.
+
R is a powerful programming language and software environment for statistical computing and graphics. It is also open source, meaning that it is free to use and has a wide community of developers that have produced many scientific add-on packages and libraries that be easily accessed.
 +
 
 +
We shall be using the basic R interpreter, and not RStudio, which in fact is the same interpreter but in a more visually appealing guise.
  
 
In this part you will learn to:
 
In this part you will learn to:
use basic R commands
+
* use basic R commands
You will use the following tools, which have been pre-installed on our bioinformatics training server at the University of
+
* find your way around.
Edinburgh:
+
 
R: http://www.r-project.org
+
We shall practice with the following toy file:
You will use the following files:
+
* <code>genomes.csv</code>: information about a number of animal genomes
genomes.csv: information about a number of animal genomes
+
 
 +
= Getting in and out =
 +
 
 +
To start, type R and hit <code>Enter</code>:
 +
 
 +
R
 +
 
 +
With this command you move out of the linux command-line, into the "R interpreter", you can tell by the prompt, which is an <code>></code> and not a <code>$</code>.
 +
 
 +
To quit R, use <code>quit()</code>, or, shorter, <code>q()</code>
 +
 
 +
> q()
  
> Type text like this in the terminal at the > command prompt, then press the
+
= Finding one's location =
[Enter] key to run the command.
 
For most commands the output is shown below the command.
 
  
Go to the directory 08_Introduction_to_r:
+
OK, now go back in and let's check what the working directory we are in by using the <code>getwd()</code> function:
  
  $ cd ~/Data/08_Introduction_to_r
+
  getwd()
  
= The basics =
+
See what files and directories there are there:
  
To start, type R and hit [Enter]:
+
dir()
  
  R
+
The working directory can be changed using setwd(), which also, liek the linux command-line, obeys tab-completion. Type
 +
 
 +
  setwd("i2<TAB>
 +
 
 +
This should complete to
  
First, check what the working directory of R is using the getwd() function:
+
setwd("i2rda_data
  
> getwd()
+
which you finish off with a <code>")</code>
[1] "/home/training"
 
  
The working directory can be changed using setwd().
+
= Calculator =
  
 
R can be used as a glorified calculator:
 
R can be used as a glorified calculator:
  
  > 5*5 + 10/2
+
  5*5 + 10/2
 
  [1] 30
 
  [1] 30
  
Line 41: Line 55:
 
Values can be assigned to a variable using the <- assignment operator:
 
Values can be assigned to a variable using the <- assignment operator:
  
> a <- 5
+
a <- 5
 +
 
 +
Alternatively, it is also is possible to write 5 -> a or a = 5, but a <- 5 is the traditional way. It can be good to use <code><-</code> over <code>=</code> to keep reminding yourself that you are actually in the R environment.
 +
 
 +
To print out a simple variable, you can just type it.
  
Alternatively, it is also is possible to write 5 -> a or a = 5, but a <- 5 is the conventional and preferred way. It can be good to remind yourself that you are actually in the R environment.
+
a
 +
[1] 5
  
To print out a variable, just type it:
+
Be aware that more complicated variables may throw out all their contents on the screen, so this simple method of seeing a variable is not always the best.
  
> a
+
= R Functions =
    [1] 5
 
  
 
R functions are invoked by their name, followed by parentheses that contain zero or more arguments.
 
R functions are invoked by their name, followed by parentheses that contain zero or more arguments.
Line 54: Line 72:
 
All existing variables can be listed using the ls() function:
 
All existing variables can be listed using the ls() function:
  
  > ls()
+
  ls()
 
  [1] "a"
 
  [1] "a"
 
  
 
Variables can be removed using the rm() function:
 
Variables can be removed using the rm() function:
  
 
  > rm(a)
 
  > rm(a)
> ls()
 
character(0)
 
  
R contains extensive documentation, which can be accessed using the ? operator or help() function:
+
If you have lots of variables you want to get rid of (i.e. maybe you are starting over):
  
  > ?ls
+
  rm(list=ls())
> help(ls)
 
  
The documentation is also available online: http://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.
+
R contains extensive documentation for all its functions, which can be accessed using the ? operator or help() function:
 +
 
 +
?ls
 +
 
 +
As you can see the parentheses are not required. The documentation is also available online: http://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.
  
 
= Data types =
 
= Data types =
Line 463: Line 481:
  
 
This saves the file genomes.txt in the working directory (/home/training).
 
This saves the file genomes.txt in the working directory (/home/training).
 
To quit R, use <code>quit()</code>, or, shorter, <code>q()</code>
 
 
> q()
 
  
 
Useful resources
 
Useful resources

Revision as of 11:02, 10 May 2017

Aims

R is a powerful programming language and software environment for statistical computing and graphics. It is also open source, meaning that it is free to use and has a wide community of developers that have produced many scientific add-on packages and libraries that be easily accessed.

We shall be using the basic R interpreter, and not RStudio, which in fact is the same interpreter but in a more visually appealing guise.

In this part you will learn to:

  • use basic R commands
  • find your way around.

We shall practice with the following toy file:

  • genomes.csv: information about a number of animal genomes

Getting in and out

To start, type R and hit Enter:

R

With this command you move out of the linux command-line, into the "R interpreter", you can tell by the prompt, which is an > and not a $.

To quit R, use quit(), or, shorter, q()

> q()

Finding one's location

OK, now go back in and let's check what the working directory we are in by using the getwd() function:

getwd()

See what files and directories there are there:

dir()

The working directory can be changed using setwd(), which also, liek the linux command-line, obeys tab-completion. Type

setwd("i2<TAB>

This should complete to

setwd("i2rda_data

which you finish off with a ")

Calculator

R can be used as a glorified calculator:

5*5 + 10/2
[1] 30

R respects standard mathematical rules: multiplication and division are carried out before addition and subtraction. Operations can be grouped with round brackets, so they are carried out first.

Values can be assigned to a variable using the <- assignment operator:

a <- 5

Alternatively, it is also is possible to write 5 -> a or a = 5, but a <- 5 is the traditional way. It can be good to use <- over = to keep reminding yourself that you are actually in the R environment.

To print out a simple variable, you can just type it.

a
[1] 5

Be aware that more complicated variables may throw out all their contents on the screen, so this simple method of seeing a variable is not always the best.

R Functions

R functions are invoked by their name, followed by parentheses that contain zero or more arguments.

All existing variables can be listed using the ls() function:

ls()
[1] "a"

Variables can be removed using the rm() function:

> rm(a)

If you have lots of variables you want to get rid of (i.e. maybe you are starting over):

rm(list=ls())

R contains extensive documentation for all its functions, which can be accessed using the ? operator or help() function:

?ls

As you can see the parentheses are not required. The documentation is also available online: http://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

Data types

There are various different data types in R. One of the most fundamental is the vector, which an array or a list of numbers or words. That is a description, but in R it is best to use the term vector for these things. For example a simple number, is seen by R as a vector of size 1.

R's data frames are also fundamental, though they are made up of vectors. They are used to store tables of data.

As analysis get complicated, many packages assemble their own data types made up of multiple vectors and dataframes often of different sizes. R often calls these objects, but tey can also be called data structures, which is more descriptive, as they are structures. Using that analogy we could call vectors apartments, data-frame buildings of apartments, and data structures apartment building complexes.

Vectors

A vector is a sequence of data elements of the same basic type (e.g. numbers or character strings). The elements of a vector are called components.

Vectors can be created in several ways, i.e. by specifying a range or by using the c() or seq() function:

> x <- 1:10
> x
[1] 1 2 3 4 5 6 7 8 9 10

> y <- c(1,1,2,3,5,8,13,21,34,55)
> y
[1] 1 1 2 3 5 8 13 21 34 55
> z <- seq(from = 2, to =
> z
[1]
2
4
6
8 10
[20] 40 42 44 46 48
[39] 78 80 82 84 86

100, by = 2) 12 50 88

14 52 90

16 54 92

18 56 94

20 58 96

22 24 60 62 98 100

26 64

28 66

30 68

32 70

34 72

36 74

38 76

Note that the bracketed number in the output is showing the indices of the components of the vector.

The components of a vector can be accessed through their indices:

> z[1] [1] 2 > z[c(1,10,20)] [1] 2 20 40 > z[1:20] [1] 2 4

6

8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

Data frames

Data frames are the most common data type in R. A data frame is basically a table that contains vectors of equal length. The key is that these vectors are what you would normally think of as the columns, not the rows, of the table. The top line of the table, called the header, contains the column names. Each horizontal line afterward denotes a data row, which begins with the name of the row, followed by the actual data. Each data member of a row is called a cell. The data in the columns of a data frame can be of various types., i.e. one column might contain numbers while another might contain text. The columns and rows of a data frame are also referred to as variables and observations, respectively.

Creating a data frame

In the following example you will create a data frame from a comma-separated (.csv) file. The genomes.csv file (which is in the directory /hom e/training/08_Introduction_to_r) contains information about a number of animal genomes from the Animal Genome Size Database, i.e. the scientific (Latin) name of the species, the taxonomic class it belongs to, the number of chromosomes and its size:

,name,class,chromosomes,size African clawed frog,Xenopus laevis,Amphibia,36,3.09 Atlantic cod,Gadus morhua,Osteichthyes,46,0.4 Brown rat,Rattus norvegicus,Mammalia,42,3.05 Chimpanzee,Pan troglodytes,Mammalia,48,3.76 Coelacanth,Latimeria chalumnae,Osteichthyes,48,3.61 ...

Note that the first line in the file contains the header names and that the lines containing the actual data all start with the common name of the species ("African clawed frog", "Atlantic cod" etc.)

Create a data frame named "genomes" from the file genomes.csv using read.csv():

> genomes <- read.csv("genomes.csv", row.names = 1)

where: genomes.csv: the file from which the data frame is created row.names = 1: the contents of the first column in the genomes.csv file ("African clawed frog", "Atlantic cod" etc.) are used as row names The function read.csv is a more specific version of the generic read.table function, with a number of default arguments set specifically for use with comma-separated files. There is also a version for use with tab-separated files, i.e. read.delim.

Display the internal structure of the genomes data frame using str():

>str(genomes)
'data.frame': 20 obs. of 4 variables:
$ name
: Factor w/ 20 levels "Aedes aegypti",..: 20 10 19 18 14 9 5 11 6 17 ...
$ class
: Factor w/ 6 levels "Amphibia" "Aves",..: 1 5 4 4 5 4 4 2 4 4 ...
$ chromosomes: int 36 46 42 48 48 38 60 78 78 54 ...
$ size
: num 3.09 0.4 3.05 3.76 3.61 2.91 3.7 1.25 2.8 3.06 ...

where:

  • Factor: a vector whose elements can take on one of a specific set of values ("levels")
  • int: integer (whole number)
  • num: numeric (decimal number)

Retrieve the dimensions of the data frame using dim():

> dim(genomes)
[1] 20 4

This means that the data frame consists of 20 rows (observations) and 4 columns (variables). Note that the header and the column with the row names are not included in this count.

To print the whole data frame just type its name:

> genomes

To print only the first or last rows use head() and tail(), respectively:

> head(genomes)

Retrieving data

The column vectors can be retrieved in various ways, i.e. using double brackets [[]], the $ operator, or single brackets []:

> genomes3 [1] 36 46 42 48 48 38 60 78 78 54

8 36 46

6 80 40 42 80

6 48

> genomes"chromosomes" [1] 36 46 42 48 48 38 60 78 78 54

8 36 46

6 80 40 42 80

6 48

> genomes$chromosomes [1] 36 46 42 48 48 38 60 78 78 54

8 36 46

6 80 40 42 80

6 48

> genomes[,3] [1] 36 46 42 48 48 38 60 78 78 54

8 36 46

6 80 40 42 80

6 48

> genomes[,"chromosomes"] [1] 36 46 42 48 48 38 60 78 78 54

8 36 46

6 80 40 42 80

6 48

Note that when using single brackets, the column number (or name) is prepended with a comma character, which signals a wildcard match for the row position (meaning that we retrieve the value for the column indicated for every row).

Column slices can be retrieved using single brackets []:

> genomes[3] African clawed frog Atlantic cod Brown rat ...

chromosomes 36 46 42

> genomes["chromosomes"] African clawed frog Atlantic cod Brown rat ...

chromosomes 36 46 42

> genomes[c(3,4)] African clawed frog Atlantic cod Brown rat ...

chromosomes 36 46 42

size 3.09 0.40 3.05

As can row slices:

> genomes[13,] name class chromosomes size Human Homo sapiens Mammalia 46 3.5 > genomes["Human",] name class chromosomes size Human Homo sapiens Mammalia 46 3.5

And combinations of both:

> genomes[13,3] [1] 46 > genomes["Human","chromosomes"] [1] 46

Subsets of data can be selected using which():

> genomes[which(genomes$class == "Mammalia"),] name class chromosomes size Brown rat Rattus norvegicus Mammalia 42 3.05 Chimpanzee Pan troglodytes Mammalia 48 3.76 Domestic cat Felis catus Mammalia 38 2.91 Domestic cattle Bos taurus Mammalia 60 3.70 Domestic dog Canis familiaris Mammalia 78 2.80 Duck-billed platypus Ornithorhynchus anatinus Mammalia 54 3.06 Human Homo sapiens Mammalia 46 3.50 Mouse Mus musculus Mammalia 40 3.25 > genomes[which(genomes$class == "Mammalia" & name Chimpanzee Pan troglodytes Domestic cattle Bos taurus Domestic dog Canis familiaris Duck-billed platypus Ornithorhynchus anatinus

genomes$chromosomes > 46),] class chromosomes size Mammalia 48 3.76 Mammalia 60 3.70 Mammalia 78 2.80 Mammalia 54 3.06

> genomes[which(genomes$class == "Mammalia" & genomes$chromosomes > 46), "name"] [1] "Pan troglodytes" "Bos taurus" [3] "Canis familiaris" "Ornithorhynchus anatinus"

Logical operators == is equal to != is not equal to > is greater than < is less than >= is greater than or equal to <= is less than or equal to & and | or

Plotting Plots are an important part of statistics, so it is no surprise that R has many plotting facilities.

Let's plot for the 20 species in our genomes data frame the genome size against the number of chromosomes (i.e. the genome size on the y-axis and the number of chromosomes on the x-axis) using plot():

> plot(genomes$chromosomes, genomes$size)

where: genomes$chromosomes: the x coordinates of points in the plot genomes$size: the y coordinates of points in the plot

You can make plots as elaborate as you want.

To add titles for the plot and labels for the axes as well as base the colour of the data points on the taxonomic class to which the species belongs:

> plot(genomes$chromosomes, genomes$size, main = "Genome Size vs. Number of Chromosomes for 20 Animal Species", xlab = "Number of Chromosomes", ylab = "Genome Size (Gb)", col = genomes$class)

where:

main = "Genome Size vs. Number of Chromosomes for 20 Animal Species": an overall title for the plot xlab = "Number of Chromosomes": a title for the x-axis ylab = "Genome Size (Gb)": a title for the y-axis col = genomes$class: variable on which the colour of the data points is based

To add labels to the 20 points in the plot, use text():

> text(genomes$chromosomes, genomes$size, row.names(genomes), cex = 0.5, pos = 1)

where: genomes$chromosomes, genomes$size: numeric vectors of coordinates where the text should be written row.names(genomes): character vector specifying the text to be written cex = 0.5: character size pos = 1: a position specifier for the text (1 means below)

To save the plot as a Portable Network Graphics (.png) file, first create a file using png() :

> png('size_vs_chrom.png')

Then run the commands to generate the plot:

> plot(genomes$chromosomes, genomes$size, main = "Genome Size vs. Number of Chromosome for 20 Animal Species", xlab = "Number of Chromosomes", ylab = "Genome Size (Gb)", col = genomes$class)

> text(genomes$chromosomes, genomes$size, row.names(genomes), cex = 0.5, pos = 1)

And finally close the file using dev.off():

> dev.off()

This saves the file size_vs_chrom.png in the working directory (/home/training).

Writing to files The content of a data frame can be written to a file using write.table():

> write.table(genomes, "genomes.txt")

This saves the file genomes.txt in the working directory (/home/training).

Useful resources