Difference between revisions of "Key Aspects of R Exercise"
Line 97: | Line 97: | ||
As analysis get complicated, many packages assemble their own data types made up of multiple vectors and multiple data frames often of different sizes. R often calls these objects, but they can also be called data structures, which is more descriptive, as they really are structures of data. Using that analogy we could call vectors apartments, data-frame buildings of apartments, and data structures apartment building complexes. | As analysis get complicated, many packages assemble their own data types made up of multiple vectors and multiple data frames often of different sizes. R often calls these objects, but they can also be called data structures, which is more descriptive, as they really are structures of data. Using that analogy we could call vectors apartments, data-frame buildings of apartments, and data structures apartment building complexes. | ||
− | + | = Vectors = | |
A vector is a sequence of data elements of the same basic type (e.g. numbers or character strings). The elements of a vector are called components. | A vector is a sequence of data elements of the same basic type (e.g. numbers or character strings). The elements of a vector are called components. | ||
Line 116: | Line 116: | ||
Multiple indices need to be wrapped in a <code>c()</code>. | Multiple indices need to be wrapped in a <code>c()</code>. | ||
− | + | z[c(2,4)] | |
− | [1] 12 30 | + | [1] 12 30 |
= Data frames = | = Data frames = | ||
Line 123: | Line 123: | ||
Data frames are the most common data type in R. A data frame is basically a table that contains vectors of equal length, usually though of as the rows or observations. The columns are usually thought of as the measurement, or the values of certain attributes of the obsveration. The top line called the header, contains the column names, which can be simple indices, or proper names. Each horizontal line afterward denotes a data row, which begins with the name of the row, followed by the actual data. | Data frames are the most common data type in R. A data frame is basically a table that contains vectors of equal length, usually though of as the rows or observations. The columns are usually thought of as the measurement, or the values of certain attributes of the obsveration. The top line called the header, contains the column names, which can be simple indices, or proper names. Each horizontal line afterward denotes a data row, which begins with the name of the row, followed by the actual data. | ||
− | + | = Creating a data frame = | |
In the following example you will create a data frame from a comma-separated (<code>.csv</code>) file. The <code>genomes.csv</code> file (which is in the directory <code>~/i2rda_data/06_R_Usage_Refresher</code>) contains information about a number of animal genomes from the Animal Genome Size Database, i.e. the scientific (Latin) name of the species, the taxonomic class it belongs to, the number of chromosomes and its size: | In the following example you will create a data frame from a comma-separated (<code>.csv</code>) file. The <code>genomes.csv</code> file (which is in the directory <code>~/i2rda_data/06_R_Usage_Refresher</code>) contains information about a number of animal genomes from the Animal Genome Size Database, i.e. the scientific (Latin) name of the species, the taxonomic class it belongs to, the number of chromosomes and its size: | ||
Line 172: | Line 172: | ||
You retrieve the dimensions of the data frame using dim(): | You retrieve the dimensions of the data frame using dim(): | ||
− | + | dim(genomes) | |
[1] 20 4 | [1] 20 4 | ||
Line 189: | Line 189: | ||
The column vectors can be retrieved in several ways ways, i.e. using double brackets [[]], the $ operator, or single brackets []: | The column vectors can be retrieved in several ways ways, i.e. using double brackets [[]], the $ operator, or single brackets []: | ||
− | genomes[[3]] | + | genomes\[[3]\] |
or | or |
Revision as of 00:05, 11 May 2017
Contents
Aims
R is a powerful programming language and software environment for statistical computing and graphics. It is also open source, meaning that it is free to use and has a wide community of developers that have produced many scientific add-on packages and libraries that be easily accessed.
We shall be using the basic R interpreter, and not RStudio, which in fact is the same interpreter but in a more visually appealing guise.
In this part you will learn to:
- use basic R commands
- find your way around.
We shall practice with the following toy file:
-
genomes.csv
: information about a number of animal genomes
Getting in and out
To start, type R and hit Enter
:
R
With this command you move out of the linux command-line, into the "R interpreter", you can tell by the prompt, which is an >
and not a $
.
To quit R, use quit()
, or, shorter, q()
> q()
Finding one's location
OK, now go back in and let's check what the working directory we are in by using the getwd()
function:
getwd()
See what files and directories there are there:
dir()
The working directory can be changed using setwd(), which also, like the linux command-line, obeys tab-completion. Type
setwd("i2<TAB>
This should complete to
setwd("i2rda_data
which you finish off with a ")
Calculator
R can be used as a glorified calculator:
5*5 + 10/2 [1] 30
R respects standard mathematical rules: multiplication and division are carried out before addition and subtraction. Operations can be grouped with round brackets, so they are carried out first.
Values can be assigned to a variable using the <- assignment operator:
a <- 5
Alternatively, it is also is possible to write 5 -> a
or a = 5, but a <- 5
is the traditional way. It can be good to use <-
over =
to keep reminding yourself that you are actually in the R environment.
To print out a simple variable, you can just type it.
a [1] 5
Be aware that more complicated variables may throw out all their contents on the screen, so this simple method of seeing a variable is not always the best.
R Functions
R functions are invoked by their name, followed by parentheses that contain zero or more arguments.
All existing variables can be listed using the ls() function:
ls() [1] "a"
Variables can be removed using the rm() function:
> rm(a)
If you have lots of variables you want to get rid of (i.e. maybe you are starting over):
rm(list=ls())
R contains extensive documentation for all its functions, which can be accessed using the ? operator or help() function:
?ls
As you can see the parentheses are not required. The documentation is also available online: http://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.
Data types
There are various different data types in R. One of the most fundamental is the vector, which an array or a list of numbers or words. That is a description, but in R it is best to use the term vector for these things. For example a simple number, is seen by R as a vector of size 1.
R's data frames are also fundamental, though they are made up of vectors. They are used to store tables of data.
As analysis get complicated, many packages assemble their own data types made up of multiple vectors and multiple data frames often of different sizes. R often calls these objects, but they can also be called data structures, which is more descriptive, as they really are structures of data. Using that analogy we could call vectors apartments, data-frame buildings of apartments, and data structures apartment building complexes.
Vectors
A vector is a sequence of data elements of the same basic type (e.g. numbers or character strings). The elements of a vector are called components.
Vectors can be created in several ways, i.e. by specifying a range or by using the c() or seq() function:
x <- 1:10 y <- c(1,1,2,3,5,8,13,21,34,55) z <- seq(from = 3, to = 30, by= 9) [1] 3 12 21 30
Note that the first, square-bracketed number in the output is inot part of the vector, but the index of of the first element of the vector. THis is useful when looking at large vectors or objects, as each new line will be preceded by the index of the first element on that line.
The components of a vector can be accessed through their indices in this manner
z[2] [1] 12
Multiple indices need to be wrapped in a c()
.
z[c(2,4)] [1] 12 30
Data frames
Data frames are the most common data type in R. A data frame is basically a table that contains vectors of equal length, usually though of as the rows or observations. The columns are usually thought of as the measurement, or the values of certain attributes of the obsveration. The top line called the header, contains the column names, which can be simple indices, or proper names. Each horizontal line afterward denotes a data row, which begins with the name of the row, followed by the actual data.
Creating a data frame
In the following example you will create a data frame from a comma-separated (.csv
) file. The genomes.csv
file (which is in the directory ~/i2rda_data/06_R_Usage_Refresher
) contains information about a number of animal genomes from the Animal Genome Size Database, i.e. the scientific (Latin) name of the species, the taxonomic class it belongs to, the number of chromosomes and its size:
,name,class,chromosomes,size African clawed frog,Xenopus laevis,Amphibia,36,3.09 Atlantic cod,Gadus morhua,Osteichthyes,46,0.4 Brown rat,Rattus norvegicus,Mammalia,42,3.05 Chimpanzee,Pan troglodytes,Mammalia,48,3.76 Coelacanth,Latimeria chalumnae,Osteichthyes,48,3.61 etc
Note that the first line in the file contains the header names and that the lines containing the actual data all start with the common name of the species ("African clawed frog", "Atlantic cod" etc.)
Create a data frame named "genomes" from the file genomes.csv using read.csv()
:
genomes <- read.csv("genomes.csv", row.names = 1)
where:
-
genomes.csv
: the file from which the data frame is created -
row.names = 1
: the contents of the first column in thegenomes.csv
file ("African clawed frog", "Atlantic cod" etc.) are used as row names
The function read.csv
is a more specific version of the generic read.table function, with a number of default arguments set specifically for use with comma-separated files. There is also a version for use with tab-separated files, i.e. read.delim
.
Display the internal structure of the genomes data frame using str()
:
str(genomes) 'data.frame': 20 obs. of 4 variables: $ name : Factor w/ 20 levels "Aedes aegypti",..: 20 10 19 18 14 9 5 11 6 17 ... $ class : Factor w/ 6 levels "Amphibia" "Aves",..: 1 5 4 4 5 4 4 2 4 4 ... $ chromosomes: int 36 46 42 48 48 38 60 78 78 54 ... $ size : num 3.09 0.4 3.05 3.76 3.61 2.91 3.7 1.25 2.8 3.06 ...
where
:
- Factor: a vector whose elements can take on one of a specific set of values ("levels")
- int: integer (whole number)
- num: numeric (decimal number)
You will note that after the dollar sign, we get the name of column or variable. WIth more complicated structure, these names are often not given, but they can always be found with the names()
function. Try
names(genome)
and see if you get the variable names. Together, the str(
) and names(
) functions give you away of what is called "introspecting" your objects or data structures.
You retrieve the dimensions of the data frame using dim():
dim(genomes) [1] 20 4
This means that the data frame consists of 20 rows (observations) and 4 columns (variables). Note that the header and the column with the row names are not included in this count.
To print the whole data frame just type its name:
genomes
To print only the first or last rows use head() and tail(), respectively:
head(genomes)
Retrieving data
The column vectors can be retrieved in several ways ways, i.e. using double brackets [[]], the $ operator, or single brackets []:
genomes\[[3]\]
or
genomes"chromosomes"
or
genomes$chromosomes
or
genomes[,3]
or
genomes[,"chromosomes"]
Note that when using single brackets, the column number (or name) is prepended with a comma character, which signals a wildcard match for the row position (meaning that we retrieve the value for the column indicated for every row).
Column slices can be retrieved using single brackets []:
genomes[3] African clawed frog Atlantic cod Brown rat ...
chromosomes 36 46 42
genomes["chromosomes"] African clawed frog Atlantic cod Brown rat ...
chromosomes 36 46 42
genomes[c(3,4)] African clawed frog Atlantic cod Brown rat ...
chromosomes 36 46 42
size 3.09 0.40 3.05
As can row slices:
genomes[13,] name class chromosomes size Human Homo sapiens Mammalia 46 3.5
genomes["Human",] name class chromosomes size Human Homo sapiens Mammalia 46 3.5
And combinations of both:
genomes[13,3] [1] 46 genomes["Human","chromosomes"] [1] 46
Subsets
Subsets of data can be selected using which()
:
genomes[which(genomes$class == "Mammalia"),]
We add another condition for closer focus:
genomes[which(genomes$class == "Mammalia" & genomes$chromosomes > 46),]
and
genomes[which(genomes$class == "Mammalia" & genomes$chromosomes > 46), "name"]
Logical operators
-
==
is equal to -
!=
is not equal to -
>
is greater than -
<
is less than -
>=
is greater than or equal to -
<=
is less than or equal to -
&
and -
|
or
Plotting
Plots are an important part of statistics, so it is no surprise that R has many plotting facilities.
Let's plot for the 20 species in our genomes data frame the genome size against the number of chromosomes (i.e. the genome size on the y-axis and the number of chromosomes on the x-axis) using plot():
plot(genomes$chromosomes, genomes$size)
where:
-
genomes$chromosomes
: the x coordinates of points in the plot -
genomes$size
: the y coordinates of points in the plot
You can make plots as elaborate as you want.
To add titles for the plot and labels for the axes as well as base the colour of the data points on the taxonomic class to which the species belongs:
plot(genomes$chromosomes, genomes$size, main = "Genome Size vs. Number of Chromosomes for 20 Animal Species", xlab = "Number of Chromosomes", ylab = "Genome Size (Gb)", col = genomes$class)
where:
-
main =
allows you write the overall title for the plot. THink of it as "main title" -
xlab =
allows you specify a title for the x-axis -
ylab =
similarly for the y-axis - col = genomes$class: variable on which the colour of the data points is based
To add labels to the 20 points in the plot, use text():
text(genomes$chromosomes, genomes$size, row.names(genomes), cex = 0.5, pos = 1)
where:
-
genomes$chromosomes
, genomes$size: numeric vectors of coordinates where the text should be written -
row.names(genomes)
: character vector specifying the text to be written -
cex = 0.5
: character size -
pos = 1
: a position specifier for the text (1 means below)
To save the plot as a Portable Network Graphics (.png) file, first create a file using png() :
png('size_vs_chrom.png')
Then run the commands to generate the plot:
plot(genomes$chromosomes, genomes$size, main = "Genome Size vs. Number of Chromosome for 20 Animal Species", xlab = "Number of Chromosomes", ylab = "Genome Size (Gb)", col = genomes$class) text(genomes$chromosomes, genomes$size, row.names(genomes), cex = 0.5, pos = 1)
And finally close the file using dev.off():
dev.off()
This saves the file size_vs_chrom.png
in the working directory (your home diretcory, ~
)
Writing to files
The content of a data frame can be written to a file using write.table()
:
write.table(genomes, "genomes.txt")
This saves the file genomes.txt in the working directory (~
)
Useful resources for further study
- The R Project for Statistical Computing: http://www.r-project.org
- Quick R Homepage: http://www.statmethods.net
- A (very) short introduction to R: http://cran.r-project.org/doc/contrib/Torfs+Brauer-Short-R-Intro.pdf
- An Introduction to R (long!): http://cran.r-project.org/doc/manuals/R-intro.html
- Bioconductor: http://www.bioconductor.org
- Stack Overflow (forum): http://stackoverflow.com
- Biostars (forum): http://www.biostars.org
- When instead of delighting, R disappoints: http://www.burns-stat.com/pages/Tutor/R_inferno.pdf