The aim of this tutorial is to introduce you to the basics of coding within R. To do this, we will introduce you to the basic syntax, via the use of some simple commands such as how to read in data and plot.
In this session, you will:
The best thing to do is to read each section and type (or copy and paste) the R commands (in grey boxes) into your own R session to check that you understand the code. All the data to download is in the github repository (links to download any data files are given below when you are supposed to load them).
We recommend to use RStudio to manage your R-code (in a file called ‘file.R’, where you can replace ‘file’ with any name you like) separate to your executed code in the Console window. It also holds plots, the environment and history window. In this ‘file.R’ you can store the useful and checked code for future use. For example, you could start a new R file and store in it all the commands below.
Within R code, anything behind a #
is “commented” out
and won’t be executed. It is good coding practice to comment your code
as you go through.
# This is commented code.
You can use this in your R file to comment what the code is doing in your own words.
The first step when creating a new R file is to specify a working directory. This is the folder on your computer in which you are working. It is likely to be the folder in which the R code is saved (but it doesn’t have to be). Each computer will have a different specified address for the working directory. Upon opening R, you can find out “where you are” using
getwd() # Where am I now?
You need to tell the R session where you want it to work. To do this
you use the command to “set the working directory” or
setwd()
.
For example, this could be:
setwd("~/Documents/Modelling_course/")
You can also label this working directory as ‘home’, and have other locations stored for plotting or for getting data. For example
<- "~/Documents/Modelling_course/" ## on Mac OS or Linux
home setwd(home)
or
<- "C:\\Documents\\Modelling_course\\" ## on Windows
home setwd(home)
The following symbols represent
*
+
-
<-
or
=
To check if something is equal use ==
, or not equal
using !=
.
Below are examples of the main mathematical structures that we will use in R. Copy the below text into your R session and change the numbers so that you can follow what each command is doing.
The basic element is the vector:
<- c(1,2,3,4) # vector
v
v2*v # doubles every element in vector
You can also construct and manipulate matrices:
<- matrix(0,4,2) # matrix of zeros of size 4 x 2
m # let's look at m
m 1,] # first row
m[1] # first column
m[,1]<-c(2,3,9) # Error - what does it say?
m[,1]<-c(2,3) # What happens?
m[,# let's look again at m
m 1]<-c(2,3,9,12) # Assign column 1
m[,2]<-v # Assign column 2 to be the vector we had earlier m[,
Can you generate a matrix of size 10 x 5, filled with 2s except for the first column that has the sequence 1 to 10?
Simple commands such as ‘sum’ are already encoded in R.
<- sum(m)
total <- m / total
proportions_m sum(proportions_m) # Check = should be 1
Other types of allocations include:
<- 1990:1993
year <- c("influenza", "nCov19", "VZV", "HIV") names
To find help in R you can use several different commands. For example, if you wanted to know more about the “sum” command you would type:
help(sum)
?sum ??sum
The first two are equivalent, whilst the last will do a full text search of all R’s help files. This is most useful when you aren’t sure of what the exact function is called in R. In RStudio, there is also the option to search the help tab in the bottom right window. Google is also very useful for finding commands.
Can you use ??
to find out how you would generate
binomial random numbers in R?
As R is open source, it is being developed and added to all the time. It is likely that you may need commands that are not in the standard R programme that you have installed.
To get access to these other functions, you need to install ‘packages’. These contain ‘libraries’ of functions, already coded into R, that can perform new or different functions.
There are two ways to do this: 1) In RStudio, go to the “Packages” tab and search for the required package and load it. 2) Type
install.packages(package_name)
replacing “package_name” with the name of the package you would like
to install. For example, can you install the packages
deSolve
and curl
? We will use these packages
to download data and solve ordinary differential equations later on in
the tutorials.
Once installed you then have to load the package at the start of the R file. This is done using
library(curl) # load the curl library of function (to import data)
library(deSolve) # load the deSolve library of functions (to solve ODE's)
In order to read in data, the first thing to do is to check that you
are pointing R at the right folder by setting the appropriate working
directory. To do this make sure you
setwd(where_the_data_is)
, to point R to wherever you have
decided to store your files.
For this tutorial, you will first need to download the required data from (in CSV format) from here. Save it in the correct folder and make this your current working directory.
You can do it automatically via R using the ‘curl’ package. The following command will download the data and store it in your current directory under ‘data.csv’
curl_download("https://github.com/lwillem/modelling-intro/raw/master/data/data.csv",destfile = "data.csv")
There are many ways to read in data. Here, as we have a ‘.csv’ file we could use either of:
<- read.table("data.csv", header=TRUE, sep=",") # Comma separated...
mydata <- read.csv("data.csv") # R doesn't like XLS files... likes csv files... (flu in hammersmith hospital) flu_hh
Here both mydata
and flu_hh
are the same
data. Can you check this in R?
The data is read in in the form of a data.frame
. This is
a type of data format in R that allows the inclusion of both numbers and
strings (letters). There is more information on reading in data here.
There are several useful tools to look at the data we have read in. Copy the below code into your R session and check that you understand all the commands.
# All the data - careful if big file!
flu_hh dim(flu_hh) # how big is it
head(flu_hh) # Top of the data
tail(flu_hh) # Bottom of the data
1] # First column
flu_hh[,$num_inf # or by name (as data.frame). The $ allows you to refer to a specific column.
flu_hh1,] # First row
flu_hh[colnames(flu_hh) # Column names
<- flu_hh[,1] # Grab the times in the data
times # All the times
times 2] # The second time times[
How would you go about finding the maximum time in this data? At what
time is the number of cases at its maximum? See if you can write code to
do this yourself. To do this you may need to use the function
which
that searches for the location of matching values.
The solution to this task is below.
# Is there a command called maximum?
?maximum # what does this give you? ??maximum
<-max(times) # Maximum time
max_time# At what time is the number maximum?
<-max(flu_hh[,"num_inf"])
max_num<-which(flu_hh[,"num_inf"]==max_num) # very useful function which! searches through for matching entries and gives index back
w<- flu_hh[w,"time"]
time_max_prop # OR
<- flu_hh[which(flu_hh[,"num_inf"]==max_num),"time"] time_max_prop
To add another column to a data.frame
you simply refer
to the name of this new column and assign it to have certain values. For
example, to add in a column of percentages:
<- sum(flu_hh[,"num_inf"])
sum $perc <- 100 * flu_hh$num_inf / sum
flu_hhsum(flu_hh$perc) # check = 100
R has inbuilt useful functions for creating plots. Again, copy and
paste the below and check the differences in the output. The command
plot
creates a new plot window which can then be added to
using lines
or points
. By default
plot
will create a plot with points, to alter this use the
type="l"
command as shown below.
plot(flu_hh[,1],flu_hh[,2]) # Points
plot(flu_hh[,"time"],flu_hh[,"num_inf"]) # Points
lines(flu_hh[,"time"],flu_hh[,"num_inf"]) # add lines to the same plot (and vice versa for points)
plot(flu_hh[,"time"],flu_hh[,"num_inf"],type="l") # Lines
plot(flu_hh[,"time"],flu_hh[,"num_inf"],type="l",col="red") # Red line
plot(flu_hh[,"time"],flu_hh[,"num_inf"],type="l",col="red",xlab="Time",ylab="Number") # Label axis
You can also alter the size of the axis to focus on certain areas.
For example, use the max_prop
from above to focus on the
increase in the outbreak only.
plot(flu_hh[,"time"],flu_hh[,"num_inf"],type="l",col="red",xlab="Time",ylab="Number",xlim=c(0,time_max_prop))
In order to save plots, again you need to check you are in the correct working directory. Then open a file and insert the plot like this:
jpeg("flu_hh.jpg") # or pdf etc.
plot(flu_hh[,"time"],flu_hh[,"num_inf"],type="l",col="red",xlab="Time",ylab="Number") # what do you want to go into the file
dev.off() # magic plot command - if things break do this until error
You can also plot multiple plots together. For example, using the
par
command, you can divide the window and then place the
plots within it.
pdf("all_plot.pdf",height=8, width=8)
par(mfrow=c(2,2)) # if want to plot several plots on to one window in rows x column formation e.g. here 2 x 2
plot(flu_hh[,1],flu_hh[,2]) # Points
plot(flu_hh[,"time"],flu_hh[,"num_inf"]) # Points
lines(flu_hh[,"time"],flu_hh[,"num_inf"]) # add lines to the same plot (and vice versa for points)
plot(flu_hh[,"time"],flu_hh[,"num_inf"],type="l",col="red") # Red line
plot(flu_hh[,"time"],flu_hh[,"num_inf"],type="l",col="red",xlab="Time",ylab="Number") # Label axis
dev.off() # stops the par command - one window per plot
In order to output data, there is a write equivalent to read (again make sure you are in the appropriate working directory):
write.table(flu_hh, "flu_hh.txt", sep="\t") # Save as tab delimited
write.csv(flu_hh,"flu_hh.csv") # or as csv
A quick note here on “not a numbers”. In R these are NA
or NaN
. This is usually an error or the result of a silly
division.
0/0
1/0 # Infinity = Inf
Conditionals are built in functions that allow you to compare and
contrast elements in R. For example, the which
function
gives you the index of values that match some condition:
which(flu_hh$time > 60) # index of values that match some condition
which(flu_hh > 60) # careful if matrix
19,] # Problem
flu_hh[which(flu_hh == 50, arr.ind=TRUE) # if want row and column
which(flu_hh > 60, arr.ind=TRUE) # if want row and column
if
statements allow you to compare elements
if( 2 > 3 ){print("World gone mad")} # Use for error checking. Print does exactly that - gives output into console
or check that data had been read in OK:
if( any(flu_hh$num_inf) < 0){print("Error: data has negative numbers")}
or check that what you are calling a proportion sums to 1:
$prop<-flu_hh$num_inf/sum(flu_hh$num_inf)
flu_hhif( sum(flu_hh$prop) != 1){print("Error: proportion sums to greater than 1")} # != means "does not equal"
if
statements can be combined with else statements.
These can be simple:
<- -1
x sqrt(ifelse(x >= 0, x, NA)) # Only take the square root if x is positive
<- ifelse( mean(flu_hh$prop) > 0.5, max(flu_hh$prop), 0) # only assign a maximum to x if it is bigger than 0.5 x
or more complex:
if(max(flu_hh$time) < 30){
<- mean(flu_hh$num_inf)
mean_flu_hh print("< 30")
else{w<-which(flu_hh$time < 30) # Which are those < 30 days, most recent data perhaps
}<- mean(flu_hh[w,"num_inf"]) # mean of only those values
mean_flu_hh print("> 30")
}
Some useful commands in R to help with checking code such as the if/else statement above are:
INDENT: “Command + I” on OS X, “Ctrl + I” on Windows. Highlight all code and hit enter. Helps find mistakes and lays out code nicely
COMMENT: “Command + C” on OS X, “Ctrl + Shift + C”. As above but makes all code selected commented. Useful if you want to comment out a load of code and run without it temporarily.
For those of you who have coded before, you will know the importance
of for
loops. For those who are new to coding, these
structures are very useful in allowing you to repeat a set of commands
multiple times to different elements.
For example, you can use for
loops to generate the two
times table
for (i in 1:10){
print(2*i)
}
Or to fill a vector with the first 50 Fibonacci numbers:
<- 50 # a number that we can then change easily that is used several times later.
nf # Tip: easier to put in one place than change multiple
<- matrix(0,1,nf) # Empty vector to fill
fibo 1] <- 0
fibo[2] <- 1
fibo[for(i in 3:nf){
<-fibo[(i-2)] + fibo[i-1]
fibo[i]<- i / 2 # what will happen to mm over time? be careful not to overwrite things
mm
}plot(seq(1:nf),fibo,type="l")
In the above example, we built an empty vector to fill within the
for
loop. However, we may not know the end size of the
resulting vector and so in this case can instead concatenate:
<- c(0,1)
fibo for(i in 3:nf){
<- c(fibo, fibo[(i-2)] + fibo[i-1])
fibo print(length(fibo))
}
Functions are useful ways of packaging up pieces of code. If you are going to do a set of commands multiple times then it can be simpler and cleaner to place them in a function that has been thoroughly checked and that you have not altered subsequently.
To build a function you use the syntax
<- function(inputs){ code } name_of_function
Within the ‘code’ section you need to include the main coding information, which uses the inputs and also ‘returns’ an output.
For example, if I want to build my own function for calculating the mean to a power X, I might do:
<- function(vector, times){
calculate_mean_power <- sum(vector)
sum_vector <- length(vector)
length_vector <- sum_vector / length_vector
my_mean <- my_mean ^ times
my_mean_power return(list(mean = my_mean, mean_power = my_mean_power))
}
To run the function you need the function name and the correct inputs:
calculate_mean_power( c(1,2,3,43) , 3) # prints output
<- calculate_mean_power(c(1,2,3,43), 3)
my_mean my_mean
You can store all the functions that you have in a separate R-file
and load them into your current R file using the command
source
.
Can you understand what the following function does for a matrix containing the ward name, the reporting time (in days) and the number of infected patients:
<- function(matrix){
find_day_many <- unique(matrix$ward) # What wards are there?
u <- c() # store in here
days for(i in 1:length(u)){
<- which(matrix$ward == u[i]) # find which rows of the data are for this ward
w <- which(matrix[w,"number"] > 1) # find rows for this ward with more than 1 infected patient
r <- c(days,matrix[w[r],"time"]) # add days when there are 2 or more infected
days
}return(days)
}
To test the function, a file ward_data.csv
can be found
here
or loaded using the following code:
#download the ward data
curl_download("https://github.com/lwillem/modelling-intro/raw/master/data/ward_data.csv", destfile = "ward_data.csv")
Test the function above with the ward data. Does it give you the output you expected? You will need to use some of the commands for looking at data that were introduced above.
<- read.csv("ward_data.csv")
ward_data <- find_day_many(ward_data)
days
# get overview in table format
table(days)
ward_data
and plot the result.