Difference between revisions of "Key Aspects of R Exercise"

From wiki
Jump to: navigation, search
 
(2 intermediate revisions by the same user not shown)
Line 87: Line 87:
 
  ?ls
 
  ?ls
  
As you can see the parentheses are not required. The documentation is also available online: http://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.
+
As you can see the parentheses are not required. R's help is mostly the same as our friend the pager <code>less</code> and also the editor <code>vim</code>: i.e. <code>q</code> to quit, <code>G</code> to go to the bottom (useful, because that's where the examples are), <code>g</code> to go to the top, <code>ctrl+f</code> to go forwards, <code>ctrl+b</code> to go backwards, <code>/</code> to search for a term. Those keybinding should be getting familiar now, and hopefully you should also recognise that they are quite fast. The documentation is also available online: http://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.
  
 
= Data types =
 
= Data types =
Line 123: Line 123:
 
Data frames are the most common data type in R. A data frame is basically a table that contains vectors of equal length, usually though of as the rows or observations. The columns are usually thought of as the measurement, or the values of certain attributes of the obsveration. The top line called the header, contains the column names, which can be simple indices, or proper names. Each horizontal line afterward denotes a data row, which begins with the name of the row, followed by the actual data.
 
Data frames are the most common data type in R. A data frame is basically a table that contains vectors of equal length, usually though of as the rows or observations. The columns are usually thought of as the measurement, or the values of certain attributes of the obsveration. The top line called the header, contains the column names, which can be simple indices, or proper names. Each horizontal line afterward denotes a data row, which begins with the name of the row, followed by the actual data.
  
= Creating a data frame =
+
== Creating a data frame ==
  
 
In the following example you will create a data frame from a comma-separated (<code>.csv</code>) file. The <code>genomes.csv</code> file (which is in the directory <code>~/i2rda_data/06_R_Usage_Refresher</code>) contains information about a number of animal genomes from the Animal Genome Size Database, i.e.  the scientific (Latin) name of the species, the taxonomic class it belongs to, the number of chromosomes and its size:
 
In the following example you will create a data frame from a comma-separated (<code>.csv</code>) file. The <code>genomes.csv</code> file (which is in the directory <code>~/i2rda_data/06_R_Usage_Refresher</code>) contains information about a number of animal genomes from the Animal Genome Size Database, i.e.  the scientific (Latin) name of the species, the taxonomic class it belongs to, the number of chromosomes and its size:
Line 187: Line 187:
 
= Retrieving data =
 
= Retrieving data =
  
The column vectors can be retrieved in several ways ways, i.e. using double brackets [[]], the $ operator, or single brackets []:
+
The column vectors can be retrieved in several ways ways, i.e. using double brackets, the <code>$</code> operator, or single brackets <code>[]</code>, liek the following (the double bracket method is omitted).
 
 
genomes\[[3]\]
 
 
 
or
 
 
 
genomes[["chromosomes"]]
 
 
 
or
 
  
 
  genomes$chromosomes
 
  genomes$chromosomes
Line 207: Line 199:
 
  genomes[,"chromosomes"]
 
  genomes[,"chromosomes"]
  
Note that when using single brackets, the column number (or name) is prepended with a comma character, which signals a wildcard match for the
+
Note that when using single brackets, the column number (or name) has what you may decide is an empty row specification, followed by a comma, and then the column specification, meaning that we retrieve the value for the column indicated for every row.
row position (meaning that we retrieve the value for the column indicated for every row).
 
  
 
Column slices can be retrieved using single brackets []:
 
Column slices can be retrieved using single brackets []:

Latest revision as of 12:14, 11 May 2017

Aims

R is a powerful programming language and software environment for statistical computing and graphics. It is also open source, meaning that it is free to use and has a wide community of developers that have produced many scientific add-on packages and libraries that be easily accessed.

We shall be using the basic R interpreter, and not RStudio, which in fact is the same interpreter but in a more visually appealing guise.

In this part you will learn to:

  • use basic R commands
  • find your way around.

We shall practice with the following toy file:

  • genomes.csv: information about a number of animal genomes

Getting in and out

To start, type R and hit Enter:

R

With this command you move out of the linux command-line, into the "R interpreter", you can tell by the prompt, which is an > and not a $.

To quit R, use quit(), or, shorter, q()

> q()

Finding one's location

OK, now go back in and let's check what the working directory we are in by using the getwd() function:

getwd()

See what files and directories there are there:

dir()

The working directory can be changed using setwd(), which also, like the linux command-line, obeys tab-completion. Type

setwd("i2<TAB>

This should complete to

setwd("i2rda_data

which you finish off with a ")

Calculator

R can be used as a glorified calculator:

5*5 + 10/2
[1] 30

R respects standard mathematical rules: multiplication and division are carried out before addition and subtraction. Operations can be grouped with round brackets, so they are carried out first.

Values can be assigned to a variable using the <- assignment operator:

a <- 5

Alternatively, it is also is possible to write 5 -> a or a = 5, but a <- 5 is the traditional way. It can be good to use <- over = to keep reminding yourself that you are actually in the R environment.

To print out a simple variable, you can just type it.

a
[1] 5

Be aware that more complicated variables may throw out all their contents on the screen, so this simple method of seeing a variable is not always the best.

R Functions

R functions are invoked by their name, followed by parentheses that contain zero or more arguments.

All existing variables can be listed using the ls() function:

ls()
[1] "a"

Variables can be removed using the rm() function:

> rm(a)

If you have lots of variables you want to get rid of (i.e. maybe you are starting over):

rm(list=ls())

R contains extensive documentation for all its functions, which can be accessed using the ? operator or help() function:

?ls

As you can see the parentheses are not required. R's help is mostly the same as our friend the pager less and also the editor vim: i.e. q to quit, G to go to the bottom (useful, because that's where the examples are), g to go to the top, ctrl+f to go forwards, ctrl+b to go backwards, / to search for a term. Those keybinding should be getting familiar now, and hopefully you should also recognise that they are quite fast. The documentation is also available online: http://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

Data types

There are various different data types in R. One of the most fundamental is the vector, which an array or a list of numbers or words. That is a description, but in R it is best to use the term vector for these things. For example a simple number, is seen by R as a vector of size 1.

R's data frames are also fundamental, though they are made up of vectors. They are used to store tables of data.

As analysis get complicated, many packages assemble their own data types made up of multiple vectors and multiple data frames often of different sizes. R often calls these objects, but they can also be called data structures, which is more descriptive, as they really are structures of data. Using that analogy we could call vectors apartments, data-frame buildings of apartments, and data structures apartment building complexes.

Vectors

A vector is a sequence of data elements of the same basic type (e.g. numbers or character strings). The elements of a vector are called components.

Vectors can be created in several ways, i.e. by specifying a range or by using the c() or seq() function:

x <- 1:10
y <- c(1,1,2,3,5,8,13,21,34,55)
z <- seq(from = 3, to = 30, by= 9)
[1] 3 12 21 30

Note that the first, square-bracketed number in the output is inot part of the vector, but the index of of the first element of the vector. THis is useful when looking at large vectors or objects, as each new line will be preceded by the index of the first element on that line.

The components of a vector can be accessed through their indices in this manner

z[2]
[1] 12

Multiple indices need to be wrapped in a c().

z[c(2,4)]
[1] 12 30

Data frames

Data frames are the most common data type in R. A data frame is basically a table that contains vectors of equal length, usually though of as the rows or observations. The columns are usually thought of as the measurement, or the values of certain attributes of the obsveration. The top line called the header, contains the column names, which can be simple indices, or proper names. Each horizontal line afterward denotes a data row, which begins with the name of the row, followed by the actual data.

Creating a data frame

In the following example you will create a data frame from a comma-separated (.csv) file. The genomes.csv file (which is in the directory ~/i2rda_data/06_R_Usage_Refresher) contains information about a number of animal genomes from the Animal Genome Size Database, i.e. the scientific (Latin) name of the species, the taxonomic class it belongs to, the number of chromosomes and its size:

,name,class,chromosomes,size
African clawed frog,Xenopus laevis,Amphibia,36,3.09
Atlantic cod,Gadus morhua,Osteichthyes,46,0.4
Brown rat,Rattus norvegicus,Mammalia,42,3.05
Chimpanzee,Pan troglodytes,Mammalia,48,3.76
Coelacanth,Latimeria chalumnae,Osteichthyes,48,3.61
etc

Note that the first line in the file contains the header names and that the lines containing the actual data all start with the common name of the species ("African clawed frog", "Atlantic cod" etc.)

Create a data frame named "genomes" from the file genomes.csv using read.csv():

genomes <- read.csv("genomes.csv", row.names = 1)

where:

  • genomes.csv: the file from which the data frame is created
  • row.names = 1: the contents of the first column in the genomes.csv file ("African clawed frog", "Atlantic cod" etc.) are used as row names

The function read.csv is a more specific version of the generic read.table function, with a number of default arguments set specifically for use with comma-separated files. There is also a version for use with tab-separated files, i.e. read.delim.

Display the internal structure of the genomes data frame using str():

str(genomes)
'data.frame': 20 obs. of 4 variables:
$ name
: Factor w/ 20 levels "Aedes aegypti",..: 20 10 19 18 14 9 5 11 6 17 ...
$ class
: Factor w/ 6 levels "Amphibia" "Aves",..: 1 5 4 4 5 4 4 2 4 4 ...
$ chromosomes: int 36 46 42 48 48 38 60 78 78 54 ...
$ size
: num 3.09 0.4 3.05 3.76 3.61 2.91 3.7 1.25 2.8 3.06 ...

where:

  • Factor: a vector whose elements can take on one of a specific set of values ("levels")
  • int: integer (whole number)
  • num: numeric (decimal number)

You will note that after the dollar sign, we get the name of column or variable. WIth more complicated structure, these names are often not given, but they can always be found with the names() function. Try

names(genome)

and see if you get the variable names. Together, the str() and names() functions give you away of what is called "introspecting" your objects or data structures.

You retrieve the dimensions of the data frame using dim():

dim(genomes)
[1] 20 4

This means that the data frame consists of 20 rows (observations) and 4 columns (variables). Note that the header and the column with the row names are not included in this count.

To print the whole data frame just type its name:

genomes

To print only the first or last rows use head() and tail(), respectively:

head(genomes)

Retrieving data

The column vectors can be retrieved in several ways ways, i.e. using double brackets, the $ operator, or single brackets [], liek the following (the double bracket method is omitted).

genomes$chromosomes

or

genomes[,3]

or

genomes[,"chromosomes"]

Note that when using single brackets, the column number (or name) has what you may decide is an empty row specification, followed by a comma, and then the column specification, meaning that we retrieve the value for the column indicated for every row.

Column slices can be retrieved using single brackets []:

genomes[3]
African clawed frog
Atlantic cod
Brown rat
...
chromosomes
36
46
42
genomes["chromosomes"]
African clawed frog
Atlantic cod
Brown rat
...
chromosomes
36
46
42
genomes[c(3,4)]
African clawed frog
Atlantic cod
Brown rat
...
chromosomes
36
46
42
size
3.09
0.40
3.05

As can row slices:

genomes[13,]
name
class chromosomes size
Human Homo sapiens Mammalia
46 3.5
genomes["Human",]
name
class chromosomes size
Human Homo sapiens Mammalia
46 3.5

And combinations of both:

genomes[13,3]
[1] 46
genomes["Human","chromosomes"]
[1] 46

Subsets

Subsets of data can be selected using which():

genomes[which(genomes$class == "Mammalia"),]

We add another condition for closer focus:

genomes[which(genomes$class == "Mammalia" & genomes$chromosomes > 46),]

and

genomes[which(genomes$class == "Mammalia" & genomes$chromosomes > 46), "name"]

Logical operators

  • == is equal to
  • != is not equal to
  • > is greater than
  • < is less than
  • >= is greater than or equal to
  • <= is less than or equal to
  • & and
  • | or

Plotting

Plots are an important part of statistics, so it is no surprise that R has many plotting facilities.

Let's plot for the 20 species in our genomes data frame the genome size against the number of chromosomes (i.e. the genome size on the y-axis and the number of chromosomes on the x-axis) using plot():

plot(genomes$chromosomes, genomes$size)

where:

  • genomes$chromosomes: the x coordinates of points in the plot
  • genomes$size: the y coordinates of points in the plot

You can make plots as elaborate as you want.

To add titles for the plot and labels for the axes as well as base the colour of the data points on the taxonomic class to which the species belongs:

plot(genomes$chromosomes, genomes$size, main = "Genome Size vs. Number of Chromosomes for 20 Animal Species", xlab = "Number of Chromosomes", ylab = "Genome Size (Gb)", col = genomes$class)

where:

  • main = allows you write the overall title for the plot. THink of it as "main title"
  • xlab = allows you specify a title for the x-axis
  • ylab = similarly for the y-axis
  • col = genomes$class: variable on which the colour of the data points is based

To add labels to the 20 points in the plot, use text():

text(genomes$chromosomes, genomes$size, row.names(genomes), cex = 0.5, pos = 1)

where:

  • genomes$chromosomes, genomes$size: numeric vectors of coordinates where the text should be written
  • row.names(genomes): character vector specifying the text to be written
  • cex = 0.5: character size
  • pos = 1: a position specifier for the text (1 means below)

To save the plot as a Portable Network Graphics (.png) file, first create a file using png() :

png('size_vs_chrom.png')

Then run the commands to generate the plot:

plot(genomes$chromosomes, genomes$size, main = "Genome Size vs. Number of Chromosome for 20 Animal Species", xlab = "Number of Chromosomes", ylab = "Genome Size (Gb)", col = genomes$class)
text(genomes$chromosomes, genomes$size, row.names(genomes), cex = 0.5, pos = 1)

And finally close the file using dev.off():

dev.off()

This saves the file size_vs_chrom.png in the working directory (your home diretcory, ~)

Writing to files

The content of a data frame can be written to a file using write.table():

write.table(genomes, "genomes.txt")

This saves the file genomes.txt in the working directory (~)

Useful resources for further study