Difference between revisions of "Key Aspects of R Exercise"

From wiki
Jump to: navigation, search
 
(5 intermediate revisions by the same user not shown)
Line 34: Line 34:
 
  dir()
 
  dir()
  
The working directory can be changed using setwd(), which also, liek the linux command-line, obeys tab-completion. Type
+
The working directory can be changed using setwd(), which also, like the linux command-line, obeys tab-completion. Type
  
 
  setwd("i2<TAB>
 
  setwd("i2<TAB>
Line 57: Line 57:
 
  a <- 5
 
  a <- 5
  
Alternatively, it is also is possible to write 5 -> a or a = 5, but a <- 5 is the traditional way. It can be good to use <code><-</code> over <code>=</code> to keep reminding yourself that you are actually in the R environment.
+
Alternatively, it is also is possible to write <code>5 -> a</code> or a = 5, but <code>a <- 5</code> is the traditional way. It can be good to use <code><-</code> over <code>=</code> to keep reminding yourself that you are actually in the R environment.
  
 
To print out a simple variable, you can just type it.
 
To print out a simple variable, you can just type it.
Line 87: Line 87:
 
  ?ls
 
  ?ls
  
As you can see the parentheses are not required. The documentation is also available online: http://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.
+
As you can see the parentheses are not required. R's help is mostly the same as our friend the pager <code>less</code> and also the editor <code>vim</code>: i.e. <code>q</code> to quit, <code>G</code> to go to the bottom (useful, because that's where the examples are), <code>g</code> to go to the top, <code>ctrl+f</code> to go forwards, <code>ctrl+b</code> to go backwards, <code>/</code> to search for a term. Those keybinding should be getting familiar now, and hopefully you should also recognise that they are quite fast. The documentation is also available online: http://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.
  
 
= Data types =
 
= Data types =
Line 95: Line 95:
 
R's data frames are also fundamental, though they are made up of vectors. They are used to store tables of data.
 
R's data frames are also fundamental, though they are made up of vectors. They are used to store tables of data.
  
As analysis get complicated, many packages assemble their own data types made up of multiple vectors and dataframes often of different sizes. R often calls these objects, but tey can also be called data structures, which is more descriptive, as they are structures. Using that analogy we could call vectors apartments, data-frame buildings of apartments, and data structures apartment building complexes.
+
As analysis get complicated, many packages assemble their own data types made up of multiple vectors and multiple data frames often of different sizes. R often calls these objects, but they can also be called data structures, which is more descriptive, as they really are structures of data. Using that analogy we could call vectors apartments, data-frame buildings of apartments, and data structures apartment building complexes.
  
== Vectors ==
+
= Vectors =
  
 
A vector is a sequence of data elements of the same basic type (e.g. numbers or character strings). The elements of a vector are called components.
 
A vector is a sequence of data elements of the same basic type (e.g. numbers or character strings). The elements of a vector are called components.
Line 103: Line 103:
 
Vectors can be created in several ways, i.e. by specifying a range or by using the c() or seq() function:
 
Vectors can be created in several ways, i.e. by specifying a range or by using the c() or seq() function:
  
  > x <- 1:10
+
  x <- 1:10
  > x
+
  y <- c(1,1,2,3,5,8,13,21,34,55)
[1] 1 2 3 4 5 6 7 8 9 10
+
  z <- seq(from = 3, to = 30, by= 9)
+
  [1] 3 12 21 30
> y <- c(1,1,2,3,5,8,13,21,34,55)
 
  > y
 
[1] 1 1 2 3 5 8 13 21 34 55
 
> z <- seq(from = 2, to =
 
> z
 
  [1]
 
2
 
4
 
6
 
8 10
 
[20] 40 42 44 46 48
 
[39] 78 80 82 84 86
 
  
100, by = 2)
+
Note that the first, square-bracketed number in the output is inot part of the vector, but the index of of the first element of the vector. THis is useful when looking at large vectors or objects, as each new line will be preceded by the index of the first element on that line.
12
 
50
 
88
 
  
14
+
The components of a vector can be accessed through their indices in this manner
52
 
90
 
  
16
+
z[2]
54
+
[1] 12
92
 
  
18
+
Multiple indices need to be wrapped in a <code>c()</code>.
56
+
z[c(2,4)]
94
+
[1] 12 30
  
20
+
= Data frames =
58
 
96
 
  
22 24
+
Data frames are the most common data type in R. A data frame is basically a table that contains vectors of equal length, usually though of as the rows or observations. The columns are usually thought of as the measurement, or the values of certain attributes of the obsveration. The top line called the header, contains the column names, which can be simple indices, or proper names. Each horizontal line afterward denotes a data row, which begins with the name of the row, followed by the actual data.
60 62
 
98 100
 
  
26
+
== Creating a data frame ==
64
 
 
 
28
 
66
 
 
 
30
 
68
 
 
 
32
 
70
 
 
 
34
 
72
 
 
 
36
 
74
 
 
 
38
 
76
 
 
 
Note that the bracketed number in the output is showing the indices of the components of the vector.
 
 
 
The components of a vector can be accessed through their indices:
 
 
 
> z[1]
 
[1] 2
 
> z[c(1,10,20)]
 
[1] 2 20 40
 
> z[1:20]
 
[1] 2 4
 
  
6
+
In the following example you will create a data frame from a comma-separated (<code>.csv</code>) file. The <code>genomes.csv</code> file (which is in the directory <code>~/i2rda_data/06_R_Usage_Refresher</code>) contains information about a number of animal genomes from the Animal Genome Size Database, i.e.  the scientific (Latin) name of the species, the taxonomic class it belongs to, the number of chromosomes and its size:
  
8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
+
,name,class,chromosomes,size
 +
African clawed frog,Xenopus laevis,Amphibia,36,3.09
 +
Atlantic cod,Gadus morhua,Osteichthyes,46,0.4
 +
Brown rat,Rattus norvegicus,Mammalia,42,3.05
 +
Chimpanzee,Pan troglodytes,Mammalia,48,3.76
 +
Coelacanth,Latimeria chalumnae,Osteichthyes,48,3.61
 +
etc
  
== Data frames ==
+
Note that the first line in the file contains the header names and that the lines containing the actual data all start with the common name of the species ("African clawed frog", "Atlantic cod" etc.)
  
Data frames are the most common data type in R. A data frame is basically a table that contains vectors of equal length. The key is that these vectors are what you would normally think of as the columns, not the rows, of the table. The top line of the table, called the header, contains the column names. Each horizontal line afterward denotes a data row, which begins with the name of the row, followed by the actual data. Each data member of a row is called a cell. The data in the columns of a data frame can be of various types., i.e. one column might contain numbers while another might contain text. The columns and rows of a data frame are also referred to as variables and observations, respectively.
+
Create a data frame named "genomes" from the file genomes.csv using <code>read.csv()</code>:
  
== Creating a data frame ==
+
genomes <- read.csv("genomes.csv", row.names = 1)
  
In the following example you will create a data frame from a comma-separated (.csv) file. The genomes.csv file (which is in the directory /hom
+
<ins>where</ins>:
e/training/08_Introduction_to_r) contains information about a number of animal genomes from the Animal Genome Size Database, i.e.
+
* <code>genomes.csv</code>: the file from which the data frame is created
the scientific (Latin) name of the species, the taxonomic class it belongs to, the number of chromosomes and its size:
+
* <code>row.names = 1</code>: the contents of the first column in the <code>genomes.csv</code> file ("African clawed frog", "Atlantic cod" etc.) are used as row names
  
,name,class,chromosomes,size
+
The function <code>read.csv</code> is a more specific version of the generic read.table function, with a number of default arguments set specifically for use with comma-separated files. There is also a version for use with tab-separated files, i.e. <code>read.delim</code>.
African clawed frog,Xenopus laevis,Amphibia,36,3.09
 
Atlantic cod,Gadus morhua,Osteichthyes,46,0.4
 
Brown rat,Rattus norvegicus,Mammalia,42,3.05
 
Chimpanzee,Pan troglodytes,Mammalia,48,3.76
 
Coelacanth,Latimeria chalumnae,Osteichthyes,48,3.61
 
...
 
  
Note that the first line in the file contains the header names and that the lines containing the actual data all start with the common name of the
+
Display the internal structure of the genomes data frame using <code>str()</code>:
species ("African clawed frog", "Atlantic cod" etc.)
 
  
Create a data frame named "genomes" from the file genomes.csv using read.csv():
+
  str(genomes)
 
 
  > genomes <- read.csv("genomes.csv", row.names = 1)
 
 
 
where:
 
genomes.csv: the file from which the data frame is created
 
row.names = 1: the contents of the first column in the genomes.csv file ("African clawed frog", "Atlantic cod" etc.) are used as row names
 
The function read.csv is a more specific version of the generic read.table function, with a number of default arguments set specifically for
 
use with comma-separated files. There is also a version for use with tab-separated files, i.e. read.delim.
 
 
 
Display the internal structure of the genomes data frame using str():
 
 
 
>str(genomes)
 
 
  'data.frame': 20 obs. of 4 variables:
 
  'data.frame': 20 obs. of 4 variables:
 
  $ name
 
  $ name
Line 229: Line 164:
 
* num: numeric (decimal number)
 
* num: numeric (decimal number)
  
Retrieve the dimensions of the data frame using dim():
+
You will note that after the dollar sign, we get the name of column or variable. WIth more complicated structure, these names are often not given, but they can always be found with the <code>names()</code> function. Try
 +
 
 +
names(genome)
 +
 
 +
and see if you get the variable names. Together, the <code>str(</code>) and <code>names(</code>) functions give you away of what is called "introspecting" your objects or data structures.
 +
 
 +
You retrieve the dimensions of the data frame using dim():
  
  > dim(genomes)
+
  dim(genomes)
 
  [1] 20 4
 
  [1] 20 4
  
This means that the data frame consists of 20 rows (observations) and 4 columns (variables). Note that the header and the column with the row
+
This means that the data frame consists of 20 rows (observations) and 4 columns (variables). Note that the header and the column with the row names are not included in this count.
names are not included in this count.
 
  
 
To print the whole data frame just type its name:
 
To print the whole data frame just type its name:
  
  > genomes
+
  genomes
  
 
To print only the first or last rows use head() and tail(), respectively:
 
To print only the first or last rows use head() and tail(), respectively:
  
  > head(genomes)
+
  head(genomes)
  
 
= Retrieving data =
 
= Retrieving data =
The column vectors can be retrieved in various ways, i.e. using double brackets [[]], the $ operator, or single brackets []:
 
  
> genomes[[3]]
+
The column vectors can be retrieved in several ways ways, i.e. using double brackets, the <code>$</code> operator, or single brackets <code>[]</code>, liek the following (the double bracket method is omitted).
[1] 36 46 42 48 48 38 60 78 78 54
 
  
8 36 46
+
genomes$chromosomes
  
6 80 40 42 80
+
or
  
6 48
+
genomes[,3]
  
> genomes[["chromosomes"]]
+
or
[1] 36 46 42 48 48 38 60 78 78 54
 
  
8 36 46
+
genomes[,"chromosomes"]
  
6 80 40 42 80
+
Note that when using single brackets, the column number (or name) has what you may decide is an empty row specification, followed by a comma, and then the column specification, meaning that we retrieve the value for the column indicated for every row.
  
6 48
+
Column slices can be retrieved using single brackets []:
  
> genomes$chromosomes
+
genomes[3]
[1] 36 46 42 48 48 38 60 78 78 54
+
African clawed frog
 +
Atlantic cod
 +
Brown rat
 +
...
  
8 36 46
+
chromosomes
 +
36
 +
46
 +
42
  
6 80 40 42 80
+
genomes["chromosomes"]
 +
African clawed frog
 +
Atlantic cod
 +
Brown rat
 +
...
  
6 48
+
chromosomes
 +
36
 +
46
 +
42
  
> genomes[,3]
+
genomes[c(3,4)]
[1] 36 46 42 48 48 38 60 78 78 54
+
African clawed frog
 +
Atlantic cod
 +
Brown rat
 +
...
  
8 36 46
+
chromosomes
 +
36
 +
46
 +
42
  
6 80 40 42 80
+
size
 +
3.09
 +
0.40
 +
3.05
  
6 48
+
As can row slices:
  
> genomes[,"chromosomes"]
+
genomes[13,]
[1] 36 46 42 48 48 38 60 78 78 54
+
name
 +
class chromosomes size
 +
Human Homo sapiens Mammalia
 +
46 3.5
  
8 36 46
+
genomes["Human",]
 +
name
 +
class chromosomes size
 +
Human Homo sapiens Mammalia
 +
46 3.5
  
6 80 40 42 80
+
And combinations of both:
 
 
6 48
 
 
 
Note that when using single brackets, the column number (or name) is prepended with a comma character, which signals a wildcard match for the
 
row position (meaning that we retrieve the value for the column indicated for every row).
 
 
 
Column slices can be retrieved using single brackets []:
 
 
 
> genomes[3]
 
African clawed frog
 
Atlantic cod
 
Brown rat
 
...
 
 
 
chromosomes
 
36
 
46
 
42
 
  
> genomes["chromosomes"]
+
genomes[13,3]
African clawed frog
+
[1] 46
Atlantic cod
+
genomes["Human","chromosomes"]
Brown rat
+
[1] 46
...
 
  
chromosomes
+
= Subsets =
36
 
46
 
42
 
  
> genomes[c(3,4)]
+
Subsets of data can be selected using <code>which()</code>:
African clawed frog
 
Atlantic cod
 
Brown rat
 
...
 
  
chromosomes
+
genomes[which(genomes$class == "Mammalia"),]
36
 
46
 
42
 
  
size
+
We add another condition for closer focus:
3.09
 
0.40
 
3.05
 
  
As can row slices:
+
genomes[which(genomes$class == "Mammalia" & genomes$chromosomes > 46),]
  
> genomes[13,]
+
and
name
+
genomes[which(genomes$class == "Mammalia" & genomes$chromosomes > 46), "name"]
class chromosomes size
 
Human Homo sapiens Mammalia
 
46 3.5
 
> genomes["Human",]
 
name
 
class chromosomes size
 
Human Homo sapiens Mammalia
 
46 3.5
 
  
And combinations of both:
+
= Logical operators =
 +
* <code>==</code> is equal to
 +
* <code>!=</code> is not equal to
 +
* <code>></code> is greater than
 +
* <code><</code> is less than
 +
* <code>>=</code> is greater than or equal to
 +
* <code><=</code> is less than or equal to
 +
* <code>&</code> and
 +
* <code>|</code> or
  
> genomes[13,3]
+
= Plotting =
[1] 46
 
> genomes["Human","chromosomes"]
 
[1] 46
 
  
Subsets of data can be selected using which():
 
 
> genomes[which(genomes$class == "Mammalia"),]
 
name
 
class chromosomes size
 
Brown rat
 
Rattus norvegicus Mammalia
 
42 3.05
 
Chimpanzee
 
Pan troglodytes Mammalia
 
48 3.76
 
Domestic cat
 
Felis catus Mammalia
 
38 2.91
 
Domestic cattle
 
Bos taurus Mammalia
 
60 3.70
 
Domestic dog
 
Canis familiaris Mammalia
 
78 2.80
 
Duck-billed platypus Ornithorhynchus anatinus Mammalia
 
54 3.06
 
Human
 
Homo sapiens Mammalia
 
46 3.50
 
Mouse
 
Mus musculus Mammalia
 
40 3.25
 
> genomes[which(genomes$class == "Mammalia" &
 
name
 
Chimpanzee
 
Pan troglodytes
 
Domestic cattle
 
Bos taurus
 
Domestic dog
 
Canis familiaris
 
Duck-billed platypus Ornithorhynchus anatinus
 
 
genomes$chromosomes > 46),]
 
class chromosomes size
 
Mammalia
 
48 3.76
 
Mammalia
 
60 3.70
 
Mammalia
 
78 2.80
 
Mammalia
 
54 3.06
 
 
> genomes[which(genomes$class == "Mammalia" & genomes$chromosomes > 46), "name"]
 
[1] "Pan troglodytes"
 
"Bos taurus"
 
[3] "Canis familiaris"
 
"Ornithorhynchus anatinus"
 
 
Logical operators
 
== is equal to
 
!= is not equal to
 
> is greater than
 
< is less than
 
>= is greater than or equal to
 
<= is less than or equal to
 
& and
 
| or
 
 
Plotting
 
 
Plots are an important part of statistics, so it is no surprise that R has many plotting facilities.
 
Plots are an important part of statistics, so it is no surprise that R has many plotting facilities.
  
Line 427: Line 292:
 
and the number of chromosomes on the x-axis) using plot():
 
and the number of chromosomes on the x-axis) using plot():
  
> plot(genomes$chromosomes, genomes$size)
+
plot(genomes$chromosomes, genomes$size)
  
where:
+
<ins>where</ins>:
genomes$chromosomes: the x coordinates of points in the plot
+
* <code>genomes$chromosomes</code>: the x coordinates of points in the plot
genomes$size: the y coordinates of points in the plot
+
* <code>genomes$size</code>: the y coordinates of points in the plot
  
 
You can make plots as elaborate as you want.
 
You can make plots as elaborate as you want.
Line 437: Line 302:
 
To add titles for the plot and labels for the axes as well as base the colour of the data points on the taxonomic class to which the species belongs:
 
To add titles for the plot and labels for the axes as well as base the colour of the data points on the taxonomic class to which the species belongs:
  
> plot(genomes$chromosomes, genomes$size, main = "Genome Size vs. Number of
+
plot(genomes$chromosomes, genomes$size, main = "Genome Size vs. Number of Chromosomes for 20 Animal Species", xlab = "Number of Chromosomes", ylab = "Genome Size (Gb)", col = genomes$class)
Chromosomes for 20 Animal Species", xlab = "Number of Chromosomes", ylab = "Genome
 
Size (Gb)", col = genomes$class)
 
 
 
where:
 
  
main = "Genome Size vs. Number of Chromosomes for 20 Animal Species": an overall title for the plot
+
<ins>where</ins>:
xlab = "Number of Chromosomes": a title for the x-axis
+
* <code>main =</code> allows you write the overall title for the plot. THink of it as "main title"
ylab = "Genome Size (Gb)": a title for the y-axis
+
* <code>xlab =</code> allows you specify a title for the x-axis
col = genomes$class: variable on which the colour of the data points is based
+
* <code>ylab =</code> similarly for the y-axis
 +
* col = genomes$class: variable on which the colour of the data points is based
  
 
To add labels to the 20 points in the plot, use text():
 
To add labels to the 20 points in the plot, use text():
  
> text(genomes$chromosomes, genomes$size, row.names(genomes), cex = 0.5, pos = 1)
+
text(genomes$chromosomes, genomes$size, row.names(genomes), cex = 0.5, pos = 1)
  
where:
+
<ins>where</ins>:
genomes$chromosomes, genomes$size: numeric vectors of coordinates where the text should be written
+
* <code>genomes$chromosomes</code>, genomes$size: numeric vectors of coordinates where the text should be written
row.names(genomes): character vector specifying the text to be written
+
* <code>row.names(genomes)</code>: character vector specifying the text to be written
cex = 0.5: character size
+
* <code>cex = 0.5</code>: character size
pos = 1: a position specifier for the text (1 means below)
+
* <code>pos = 1</code>: a position specifier for the text (1 means below)
  
 
To save the plot as a Portable Network Graphics (.png) file, first create a file using png() :
 
To save the plot as a Portable Network Graphics (.png) file, first create a file using png() :
  
> png('size_vs_chrom.png')
+
png('size_vs_chrom.png')
  
 
Then run the commands to generate the plot:
 
Then run the commands to generate the plot:
  
> plot(genomes$chromosomes, genomes$size, main = "Genome Size vs. Number of Chromosome
+
plot(genomes$chromosomes, genomes$size, main = "Genome Size vs. Number of Chromosome for 20 Animal Species", xlab = "Number of Chromosomes", ylab = "Genome Size (Gb)", col = genomes$class)
for 20 Animal Species", xlab = "Number of Chromosomes", ylab = "Genome Size (Gb)", col
+
  text(genomes$chromosomes, genomes$size, row.names(genomes), cex = 0.5, pos = 1)
= genomes$class)
 
  > text(genomes$chromosomes, genomes$size, row.names(genomes), cex = 0.5, pos = 1)
 
  
 
And finally close the file using dev.off():
 
And finally close the file using dev.off():
  
  > dev.off()
+
  dev.off()
 +
 
 +
This saves the file <code>size_vs_chrom.png</code> in the working directory (your home diretcory, <code>~</code>)
  
This saves the file size_vs_chrom.png in the working directory (/home/training).
+
= Writing to files =
  
Writing to files
+
The content of a data frame can be written to a file using <code>write.table()</code>:
The content of a data frame can be written to a file using write.table():
 
  
  > write.table(genomes, "genomes.txt")
+
  write.table(genomes, "genomes.txt")
  
This saves the file genomes.txt in the working directory (/home/training).
+
This saves the file genomes.txt in the working directory (<code>~</code>)
  
Useful resources
+
= Useful resources for further study =
 
* The R Project for Statistical Computing: http://www.r-project.org
 
* The R Project for Statistical Computing: http://www.r-project.org
 
* Quick R Homepage: http://www.statmethods.net
 
* Quick R Homepage: http://www.statmethods.net

Latest revision as of 12:14, 11 May 2017

Aims

R is a powerful programming language and software environment for statistical computing and graphics. It is also open source, meaning that it is free to use and has a wide community of developers that have produced many scientific add-on packages and libraries that be easily accessed.

We shall be using the basic R interpreter, and not RStudio, which in fact is the same interpreter but in a more visually appealing guise.

In this part you will learn to:

  • use basic R commands
  • find your way around.

We shall practice with the following toy file:

  • genomes.csv: information about a number of animal genomes

Getting in and out

To start, type R and hit Enter:

R

With this command you move out of the linux command-line, into the "R interpreter", you can tell by the prompt, which is an > and not a $.

To quit R, use quit(), or, shorter, q()

> q()

Finding one's location

OK, now go back in and let's check what the working directory we are in by using the getwd() function:

getwd()

See what files and directories there are there:

dir()

The working directory can be changed using setwd(), which also, like the linux command-line, obeys tab-completion. Type

setwd("i2<TAB>

This should complete to

setwd("i2rda_data

which you finish off with a ")

Calculator

R can be used as a glorified calculator:

5*5 + 10/2
[1] 30

R respects standard mathematical rules: multiplication and division are carried out before addition and subtraction. Operations can be grouped with round brackets, so they are carried out first.

Values can be assigned to a variable using the <- assignment operator:

a <- 5

Alternatively, it is also is possible to write 5 -> a or a = 5, but a <- 5 is the traditional way. It can be good to use <- over = to keep reminding yourself that you are actually in the R environment.

To print out a simple variable, you can just type it.

a
[1] 5

Be aware that more complicated variables may throw out all their contents on the screen, so this simple method of seeing a variable is not always the best.

R Functions

R functions are invoked by their name, followed by parentheses that contain zero or more arguments.

All existing variables can be listed using the ls() function:

ls()
[1] "a"

Variables can be removed using the rm() function:

> rm(a)

If you have lots of variables you want to get rid of (i.e. maybe you are starting over):

rm(list=ls())

R contains extensive documentation for all its functions, which can be accessed using the ? operator or help() function:

?ls

As you can see the parentheses are not required. R's help is mostly the same as our friend the pager less and also the editor vim: i.e. q to quit, G to go to the bottom (useful, because that's where the examples are), g to go to the top, ctrl+f to go forwards, ctrl+b to go backwards, / to search for a term. Those keybinding should be getting familiar now, and hopefully you should also recognise that they are quite fast. The documentation is also available online: http://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

Data types

There are various different data types in R. One of the most fundamental is the vector, which an array or a list of numbers or words. That is a description, but in R it is best to use the term vector for these things. For example a simple number, is seen by R as a vector of size 1.

R's data frames are also fundamental, though they are made up of vectors. They are used to store tables of data.

As analysis get complicated, many packages assemble their own data types made up of multiple vectors and multiple data frames often of different sizes. R often calls these objects, but they can also be called data structures, which is more descriptive, as they really are structures of data. Using that analogy we could call vectors apartments, data-frame buildings of apartments, and data structures apartment building complexes.

Vectors

A vector is a sequence of data elements of the same basic type (e.g. numbers or character strings). The elements of a vector are called components.

Vectors can be created in several ways, i.e. by specifying a range or by using the c() or seq() function:

x <- 1:10
y <- c(1,1,2,3,5,8,13,21,34,55)
z <- seq(from = 3, to = 30, by= 9)
[1] 3 12 21 30

Note that the first, square-bracketed number in the output is inot part of the vector, but the index of of the first element of the vector. THis is useful when looking at large vectors or objects, as each new line will be preceded by the index of the first element on that line.

The components of a vector can be accessed through their indices in this manner

z[2]
[1] 12

Multiple indices need to be wrapped in a c().

z[c(2,4)]
[1] 12 30

Data frames

Data frames are the most common data type in R. A data frame is basically a table that contains vectors of equal length, usually though of as the rows or observations. The columns are usually thought of as the measurement, or the values of certain attributes of the obsveration. The top line called the header, contains the column names, which can be simple indices, or proper names. Each horizontal line afterward denotes a data row, which begins with the name of the row, followed by the actual data.

Creating a data frame

In the following example you will create a data frame from a comma-separated (.csv) file. The genomes.csv file (which is in the directory ~/i2rda_data/06_R_Usage_Refresher) contains information about a number of animal genomes from the Animal Genome Size Database, i.e. the scientific (Latin) name of the species, the taxonomic class it belongs to, the number of chromosomes and its size:

,name,class,chromosomes,size
African clawed frog,Xenopus laevis,Amphibia,36,3.09
Atlantic cod,Gadus morhua,Osteichthyes,46,0.4
Brown rat,Rattus norvegicus,Mammalia,42,3.05
Chimpanzee,Pan troglodytes,Mammalia,48,3.76
Coelacanth,Latimeria chalumnae,Osteichthyes,48,3.61
etc

Note that the first line in the file contains the header names and that the lines containing the actual data all start with the common name of the species ("African clawed frog", "Atlantic cod" etc.)

Create a data frame named "genomes" from the file genomes.csv using read.csv():

genomes <- read.csv("genomes.csv", row.names = 1)

where:

  • genomes.csv: the file from which the data frame is created
  • row.names = 1: the contents of the first column in the genomes.csv file ("African clawed frog", "Atlantic cod" etc.) are used as row names

The function read.csv is a more specific version of the generic read.table function, with a number of default arguments set specifically for use with comma-separated files. There is also a version for use with tab-separated files, i.e. read.delim.

Display the internal structure of the genomes data frame using str():

str(genomes)
'data.frame': 20 obs. of 4 variables:
$ name
: Factor w/ 20 levels "Aedes aegypti",..: 20 10 19 18 14 9 5 11 6 17 ...
$ class
: Factor w/ 6 levels "Amphibia" "Aves",..: 1 5 4 4 5 4 4 2 4 4 ...
$ chromosomes: int 36 46 42 48 48 38 60 78 78 54 ...
$ size
: num 3.09 0.4 3.05 3.76 3.61 2.91 3.7 1.25 2.8 3.06 ...

where:

  • Factor: a vector whose elements can take on one of a specific set of values ("levels")
  • int: integer (whole number)
  • num: numeric (decimal number)

You will note that after the dollar sign, we get the name of column or variable. WIth more complicated structure, these names are often not given, but they can always be found with the names() function. Try

names(genome)

and see if you get the variable names. Together, the str() and names() functions give you away of what is called "introspecting" your objects or data structures.

You retrieve the dimensions of the data frame using dim():

dim(genomes)
[1] 20 4

This means that the data frame consists of 20 rows (observations) and 4 columns (variables). Note that the header and the column with the row names are not included in this count.

To print the whole data frame just type its name:

genomes

To print only the first or last rows use head() and tail(), respectively:

head(genomes)

Retrieving data

The column vectors can be retrieved in several ways ways, i.e. using double brackets, the $ operator, or single brackets [], liek the following (the double bracket method is omitted).

genomes$chromosomes

or

genomes[,3]

or

genomes[,"chromosomes"]

Note that when using single brackets, the column number (or name) has what you may decide is an empty row specification, followed by a comma, and then the column specification, meaning that we retrieve the value for the column indicated for every row.

Column slices can be retrieved using single brackets []:

genomes[3]
African clawed frog
Atlantic cod
Brown rat
...
chromosomes
36
46
42
genomes["chromosomes"]
African clawed frog
Atlantic cod
Brown rat
...
chromosomes
36
46
42
genomes[c(3,4)]
African clawed frog
Atlantic cod
Brown rat
...
chromosomes
36
46
42
size
3.09
0.40
3.05

As can row slices:

genomes[13,]
name
class chromosomes size
Human Homo sapiens Mammalia
46 3.5
genomes["Human",]
name
class chromosomes size
Human Homo sapiens Mammalia
46 3.5

And combinations of both:

genomes[13,3]
[1] 46
genomes["Human","chromosomes"]
[1] 46

Subsets

Subsets of data can be selected using which():

genomes[which(genomes$class == "Mammalia"),]

We add another condition for closer focus:

genomes[which(genomes$class == "Mammalia" & genomes$chromosomes > 46),]

and

genomes[which(genomes$class == "Mammalia" & genomes$chromosomes > 46), "name"]

Logical operators

  • == is equal to
  • != is not equal to
  • > is greater than
  • < is less than
  • >= is greater than or equal to
  • <= is less than or equal to
  • & and
  • | or

Plotting

Plots are an important part of statistics, so it is no surprise that R has many plotting facilities.

Let's plot for the 20 species in our genomes data frame the genome size against the number of chromosomes (i.e. the genome size on the y-axis and the number of chromosomes on the x-axis) using plot():

plot(genomes$chromosomes, genomes$size)

where:

  • genomes$chromosomes: the x coordinates of points in the plot
  • genomes$size: the y coordinates of points in the plot

You can make plots as elaborate as you want.

To add titles for the plot and labels for the axes as well as base the colour of the data points on the taxonomic class to which the species belongs:

plot(genomes$chromosomes, genomes$size, main = "Genome Size vs. Number of Chromosomes for 20 Animal Species", xlab = "Number of Chromosomes", ylab = "Genome Size (Gb)", col = genomes$class)

where:

  • main = allows you write the overall title for the plot. THink of it as "main title"
  • xlab = allows you specify a title for the x-axis
  • ylab = similarly for the y-axis
  • col = genomes$class: variable on which the colour of the data points is based

To add labels to the 20 points in the plot, use text():

text(genomes$chromosomes, genomes$size, row.names(genomes), cex = 0.5, pos = 1)

where:

  • genomes$chromosomes, genomes$size: numeric vectors of coordinates where the text should be written
  • row.names(genomes): character vector specifying the text to be written
  • cex = 0.5: character size
  • pos = 1: a position specifier for the text (1 means below)

To save the plot as a Portable Network Graphics (.png) file, first create a file using png() :

png('size_vs_chrom.png')

Then run the commands to generate the plot:

plot(genomes$chromosomes, genomes$size, main = "Genome Size vs. Number of Chromosome for 20 Animal Species", xlab = "Number of Chromosomes", ylab = "Genome Size (Gb)", col = genomes$class)
text(genomes$chromosomes, genomes$size, row.names(genomes), cex = 0.5, pos = 1)

And finally close the file using dev.off():

dev.off()

This saves the file size_vs_chrom.png in the working directory (your home diretcory, ~)

Writing to files

The content of a data frame can be written to a file using write.table():

write.table(genomes, "genomes.txt")

This saves the file genomes.txt in the working directory (~)

Useful resources for further study