Home page

Objectives

The aim of this tutorial is to introduce you to the basics of coding within R. To do this, we will introduce you to the basic syntax, via the use of some simple commands such as how to read in data and plot.

In this session, you will:

  1. familiarise yourself with the basic syntax of R
  2. learn how to create basic plots in R
  3. learn how you can use conditionals and for-loops
  4. learn how to write your own function

The best thing to do is to read each section and type (or copy and paste) the R commands (in grey boxes) into your own R session to check that you understand the code. All the data to download is in the github repository (links to download any data files are given below when you are supposed to load them).

We recommend to use RStudio to manage your R-code (in a file called ‘file.R’, where you can replace ‘file’ with any name you like) separate to your executed code in the Console window. It also holds plots, the environment and history window. In this ‘file.R’ you can store the useful and checked code for future use. For example, you could start a new R file and store in it all the commands below.

Basic syntax

Commenting code

Within R code, anything behind a # is “commented” out and won’t be executed. It is good coding practice to comment your code as you go through.

# This is commented code. 

You can use this in your R file to comment what the code is doing in your own words.

Working directory

The first step when creating a new R file is to specify a working directory. This is the folder on your computer in which you are working. It is likely to be the folder in which the R code is saved (but it doesn’t have to be). Each computer will have a different specified address for the working directory. Upon opening R, you can find out “where you are” using

getwd() # Where am I now? 

You need to tell the R session where you want it to work. To do this you use the command to “set the working directory” or setwd().

For example, this could be:

setwd("~/Documents/Modelling_course/")

You can also label this working directory as ‘home’, and have other locations stored for plotting or for getting data. For example

home <- "~/Documents/Modelling_course/" ## on Mac OS or Linux
setwd(home)

or

home <- "C:\\Documents\\Modelling_course\\" ## on Windows
setwd(home)

Mathematical elements in R

The following symbols represent

  • multiplication: *
  • addition: +
  • subtraction: -
  • assignment of a name to an object: <- or =

To check if something is equal use ==, or not equal using !=.

Below are examples of the main mathematical structures that we will use in R. Copy the below text into your R session and change the numbers so that you can follow what each command is doing.

The basic element is the vector:

v <- c(1,2,3,4) # vector
v
2*v # doubles every element in vector

You can also construct and manipulate matrices:

m <- matrix(0,4,2) # matrix of zeros of size 4 x 2
m # let's look at m
m[1,] # first row
m[,1] # first column
m[,1]<-c(2,3,9) # Error - what does it say?
m[,1]<-c(2,3) # What happens? 
m # let's look again at m
m[,1]<-c(2,3,9,12) # Assign column 1
m[,2]<-v # Assign column 2 to be the vector we had earlier

Can you generate a matrix of size 10 x 5, filled with 2s except for the first column that has the sequence 1 to 10?

Simple commands such as ‘sum’ are already encoded in R.

total <- sum(m)
proportions_m <- m / total
sum(proportions_m) # Check = should be 1

Other types of allocations include:

year <- 1990:1993
names <- c("influenza", "nCov19", "VZV", "HIV")

Help files

To find help in R you can use several different commands. For example, if you wanted to know more about the “sum” command you would type:

help(sum)
?sum
??sum

The first two are equivalent, whilst the last will do a full text search of all R’s help files. This is most useful when you aren’t sure of what the exact function is called in R. In RStudio, there is also the option to search the help tab in the bottom right window. Google is also very useful for finding commands.

Can you use ?? to find out how you would generate binomial random numbers in R?

Packages and libraries

As R is open source, it is being developed and added to all the time. It is likely that you may need commands that are not in the standard R programme that you have installed.

To get access to these other functions, you need to install ‘packages’. These contain ‘libraries’ of functions, already coded into R, that can perform new or different functions.

There are two ways to do this: 1) In RStudio, go to the “Packages” tab and search for the required package and load it. 2) Type

install.packages(package_name)

replacing “package_name” with the name of the package you would like to install. For example, can you install the packages deSolve and curl? We will use these packages to download data and solve ordinary differential equations later on in the tutorials.

Once installed you then have to load the package at the start of the R file. This is done using

library(curl) # load the curl library of function (to import data)
library(deSolve) # load the deSolve library of functions (to solve ODE's)

Reading in data

In order to read in data, the first thing to do is to check that you are pointing R at the right folder by setting the appropriate working directory. To do this make sure you setwd(where_the_data_is), to point R to wherever you have decided to store your files.

For this tutorial, you will first need to download the required data from (in CSV format) from here. Save it in the correct folder and make this your current working directory.

You can do it automatically via R using the ‘curl’ package. The following command will download the data and store it in your current directory under ‘data.csv’

curl_download("https://github.com/lwillem/modelling-intro/raw/master/data/data.csv",destfile = "data.csv")

There are many ways to read in data. Here, as we have a ‘.csv’ file we could use either of:

mydata <- read.table("data.csv", header=TRUE, sep=",") # Comma separated... 
flu_hh <- read.csv("data.csv") # R doesn't like XLS files... likes csv files... (flu in hammersmith hospital)

Here both mydata and flu_hh are the same data. Can you check this in R?

The data is read in in the form of a data.frame. This is a type of data format in R that allows the inclusion of both numbers and strings (letters). There is more information on reading in data here.

There are several useful tools to look at the data we have read in. Copy the below code into your R session and check that you understand all the commands.

flu_hh        # All the data - careful if big file!
dim(flu_hh)   # how big is it
head(flu_hh)  # Top of the data
tail(flu_hh)  # Bottom of the data
flu_hh[,1]    # First column
flu_hh$num_inf # or by name (as data.frame). The $ allows you to refer to a specific column. 
flu_hh[1,]    # First row
colnames(flu_hh)   # Column names
times <- flu_hh[,1] # Grab the times in the data
times         # All the times
times[2]      # The second time

How would you go about finding the maximum time in this data? At what time is the number of cases at its maximum? See if you can write code to do this yourself. To do this you may need to use the function which that searches for the location of matching values. The solution to this task is below.

?maximum # Is there a command called maximum?
??maximum # what does this give you?
max_time<-max(times) # Maximum time
# At what time is the number maximum?
max_num<-max(flu_hh[,"num_inf"])
w<-which(flu_hh[,"num_inf"]==max_num) # very useful function which! searches through for matching entries and gives index back
time_max_prop <- flu_hh[w,"time"]
# OR
time_max_prop <- flu_hh[which(flu_hh[,"num_inf"]==max_num),"time"]

To add another column to a data.frame you simply refer to the name of this new column and assign it to have certain values. For example, to add in a column of percentages:

sum <- sum(flu_hh[,"num_inf"])
flu_hh$perc <- 100 * flu_hh$num_inf / sum
sum(flu_hh$perc) # check = 100

Basic plotting

R has inbuilt useful functions for creating plots. Again, copy and paste the below and check the differences in the output. The command plot creates a new plot window which can then be added to using lines or points. By default plot will create a plot with points, to alter this use the type="l" command as shown below.

plot(flu_hh[,1],flu_hh[,2]) # Points

plot(flu_hh[,"time"],flu_hh[,"num_inf"]) # Points
lines(flu_hh[,"time"],flu_hh[,"num_inf"]) # add lines to the same plot (and vice versa for points)

plot(flu_hh[,"time"],flu_hh[,"num_inf"],type="l") # Lines

plot(flu_hh[,"time"],flu_hh[,"num_inf"],type="l",col="red") # Red line

plot(flu_hh[,"time"],flu_hh[,"num_inf"],type="l",col="red",xlab="Time",ylab="Number") # Label axis

You can also alter the size of the axis to focus on certain areas. For example, use the max_prop from above to focus on the increase in the outbreak only.

plot(flu_hh[,"time"],flu_hh[,"num_inf"],type="l",col="red",xlab="Time",ylab="Number",xlim=c(0,time_max_prop)) 

In order to save plots, again you need to check you are in the correct working directory. Then open a file and insert the plot like this:

jpeg("flu_hh.jpg") # or pdf etc.
plot(flu_hh[,"time"],flu_hh[,"num_inf"],type="l",col="red",xlab="Time",ylab="Number") # what do you want to go into the file
dev.off() # magic plot command - if things break do this until error

You can also plot multiple plots together. For example, using the par command, you can divide the window and then place the plots within it.

pdf("all_plot.pdf",height=8, width=8)
par(mfrow=c(2,2)) # if want to plot several plots on to one window in rows x column formation e.g. here 2 x 2
plot(flu_hh[,1],flu_hh[,2]) # Points
plot(flu_hh[,"time"],flu_hh[,"num_inf"]) # Points
lines(flu_hh[,"time"],flu_hh[,"num_inf"]) # add lines to the same plot (and vice versa for points)
plot(flu_hh[,"time"],flu_hh[,"num_inf"],type="l",col="red") # Red line
plot(flu_hh[,"time"],flu_hh[,"num_inf"],type="l",col="red",xlab="Time",ylab="Number") # Label axis
dev.off() # stops the par command - one window per plot

In order to output data, there is a write equivalent to read (again make sure you are in the appropriate working directory):

write.table(flu_hh, "flu_hh.txt", sep="\t")  # Save as tab delimited 
write.csv(flu_hh,"flu_hh.csv") # or as csv

A quick note here on “not a numbers”. In R these are NA or NaN. This is usually an error or the result of a silly division.

0/0
1/0 # Infinity = Inf

Conditionals

Conditionals are built in functions that allow you to compare and contrast elements in R. For example, the which function gives you the index of values that match some condition:

which(flu_hh$time > 60) # index of values that match some condition
which(flu_hh > 60) # careful if matrix
flu_hh[19,] # Problem
which(flu_hh == 50, arr.ind=TRUE) # if want row and column
which(flu_hh > 60, arr.ind=TRUE) # if want row and column

if statements allow you to compare elements

if( 2 > 3 ){print("World gone mad")} # Use for error checking. Print does exactly that - gives output into console

or check that data had been read in OK:

if( any(flu_hh$num_inf) < 0){print("Error: data has negative numbers")}

or check that what you are calling a proportion sums to 1:

flu_hh$prop<-flu_hh$num_inf/sum(flu_hh$num_inf)
if( sum(flu_hh$prop) != 1){print("Error: proportion sums to greater than 1")} # != means "does not equal"

if statements can be combined with else statements. These can be simple:

x <- -1
sqrt(ifelse(x >= 0, x, NA))  # Only take the square root if x is positive
x <- ifelse( mean(flu_hh$prop) > 0.5, max(flu_hh$prop), 0) # only assign a maximum to x if it is bigger than 0.5

or more complex:

if(max(flu_hh$time) < 30){ 
  mean_flu_hh <- mean(flu_hh$num_inf)
  print("< 30")
}else{w<-which(flu_hh$time < 30) # Which are those < 30 days, most recent data perhaps
      mean_flu_hh <- mean(flu_hh[w,"num_inf"]) # mean of only those values
      print("> 30")
}

Some useful commands in R to help with checking code such as the if/else statement above are:

INDENT: “Command + I” on OS X, “Ctrl + I” on Windows. Highlight all code and hit enter. Helps find mistakes and lays out code nicely

COMMENT: “Command + C” on OS X, “Ctrl + Shift + C”. As above but makes all code selected commented. Useful if you want to comment out a load of code and run without it temporarily.

For loops

For those of you who have coded before, you will know the importance of for loops. For those who are new to coding, these structures are very useful in allowing you to repeat a set of commands multiple times to different elements.

For example, you can use for loops to generate the two times table

for (i in 1:10){ 
  print(2*i)
}

Or to fill a vector with the first 50 Fibonacci numbers:

nf <- 50 # a number that we can then change easily that is used several times later. 
# Tip: easier to put in one place than change multiple
fibo <- matrix(0,1,nf) # Empty vector to fill
fibo[1] <- 0
fibo[2] <- 1
for(i in 3:nf){
  fibo[i]<-fibo[(i-2)] + fibo[i-1]
  mm <- i / 2 # what will happen to mm over time? be careful not to overwrite things
}
plot(seq(1:nf),fibo,type="l")

In the above example, we built an empty vector to fill within the for loop. However, we may not know the end size of the resulting vector and so in this case can instead concatenate:

fibo <- c(0,1)
for(i in 3:nf){
  fibo <- c(fibo, fibo[(i-2)] + fibo[i-1])
  print(length(fibo))
}

Writing your own functions

Functions are useful ways of packaging up pieces of code. If you are going to do a set of commands multiple times then it can be simpler and cleaner to place them in a function that has been thoroughly checked and that you have not altered subsequently.

To build a function you use the syntax

name_of_function <- function(inputs){ code }

Within the ‘code’ section you need to include the main coding information, which uses the inputs and also ‘returns’ an output.

For example, if I want to build my own function for calculating the mean to a power X, I might do:

calculate_mean_power <- function(vector, times){
  sum_vector       <- sum(vector)
  length_vector    <- length(vector)
  my_mean          <- sum_vector / length_vector 
  my_mean_power    <- my_mean ^ times
  return(list(mean = my_mean, mean_power = my_mean_power))
}

To run the function you need the function name and the correct inputs:

calculate_mean_power( c(1,2,3,43) , 3) # prints output
my_mean <- calculate_mean_power(c(1,2,3,43), 3)
my_mean

You can store all the functions that you have in a separate R-file and load them into your current R file using the command source.

Can you understand what the following function does for a matrix containing the ward name, the reporting time (in days) and the number of infected patients:

find_day_many <- function(matrix){
  u <- unique(matrix$ward) # What wards are there?
  days <- c() # store in here
  for(i in 1:length(u)){
    w <- which(matrix$ward == u[i]) # find which rows of the data are for this ward
    r <- which(matrix[w,"number"] > 1) # find rows for this ward with more than 1 infected patient
    days <- c(days,matrix[w[r],"time"]) # add days when there are 2 or more infected 
  }
  return(days)
}

To test the function, a file ward_data.csv can be found here or loaded using the following code:

#download the ward data
curl_download("https://github.com/lwillem/modelling-intro/raw/master/data/ward_data.csv", destfile = "ward_data.csv")

Test the function above with the ward data. Does it give you the output you expected? You will need to use some of the commands for looking at data that were introduced above.

ward_data <- read.csv("ward_data.csv")
days <- find_day_many(ward_data)

# get overview in table format
table(days)

Assignment