Week5 Data Analysis

In this session we will discuss how to create data frames (objects similar to matrices which are typically used to store data) manually (explaining how this differs from a matrix), methods of extracting and manipulating data effectively and, finally, use of the apply() function(s).

5.1 Creating a data frame

We have already been introduced to the idea of a data frame in the previous sessions through the `mtcars’ data set:

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

However, in general, a data frame is simply an array of data, where each row denotes a separate data point or observation and each column a different variable. Why are these different from matrices in R?

The main difference between a ‘matrix’ and a ‘dataframe’ in R is what can be stored within them. We have already stated that a matrix (like a vector) can consist of numerical values OR Boolean values OR character strings. However, what I have not yet mentioned is they cannot be a mixture of these things:

V1 <- c(1, 2, 3)
str(V1)
##  num [1:3] 1 2 3
V2 <- c("Hello", "Red", "UK")
str(V2)
##  chr [1:3] "Hello" "Red" "UK"
V3 <- c(T, T, F)
str(V3)
##  logi [1:3] TRUE TRUE FALSE
(V <- c(V1, V2, V3))
## [1] "1"     "2"     "3"     "Hello" "Red"   "UK"    "TRUE"  "TRUE"  "FALSE"
str(V)
##  chr [1:9] "1" "2" "3" "Hello" "Red" "UK" "TRUE" "TRUE" "FALSE"
(M <- matrix(c(V1, V2, V3), nrow = 3, ncol = 3, byrow = FALSE))
##      [,1] [,2]    [,3]   
## [1,] "1"  "Hello" "TRUE" 
## [2,] "2"  "Red"   "TRUE" 
## [3,] "3"  "UK"    "FALSE"
str(M)
##  chr [1:3, 1:3] "1" "2" "3" "Hello" "Red" "UK" "TRUE" "TRUE" "FALSE"

As you can see in the above, since a matrix can only consists of objects of the same type, R has automatically re-assigned the objects to match with one of the object types. Again, this is a perfect example of why you need to be so careful when programming as your code may appear to work but the reality is quite different.

Data frames on the other hand, allow us to create a matrix like structure but each column may take on a different variable format. This is ideal when working and storing data. Let us look at how to create a data frame using the above vectors V1, V2 and V3:

data <- data.frame(Numeric = V1, Characters = V2, Boolean = V3)
knitr::kable(data, align = "ccc")
Numeric Characters Boolean
1 Hello TRUE
2 Red TRUE
3 UK FALSE
str(data)
## 'data.frame':    3 obs. of  3 variables:
##  $ Numeric   : num  1 2 3
##  $ Characters: chr  "Hello" "Red" "UK"
##  $ Boolean   : logi  TRUE TRUE FALSE

What do you notice about the character strings in this data frame? It is important to decide if any ‘words’ or ‘character strings’ in your data set are simply words or if they resemble different factors of a given group. In the latter case, you need to tell R this! This is very important when applying a variety of statistical functions to your data.

data1 <- data.frame(Numeric = V1, Characters = V2, Boolean = V3, stringsAsFactors = F)
str(data1)
## 'data.frame':    3 obs. of  3 variables:
##  $ Numeric   : num  1 2 3
##  $ Characters: chr  "Hello" "Red" "UK"
##  $ Boolean   : logi  TRUE TRUE FALSE
data2 <- data.frame(Numeric = V1, Characters = V2, Boolean = V3, stringsAsFactors = T)
str(data2)
## 'data.frame':    3 obs. of  3 variables:
##  $ Numeric   : num  1 2 3
##  $ Characters: Factor w/ 3 levels "Hello","Red",..: 1 2 3
##  $ Boolean   : logi  TRUE TRUE FALSE
rownames(data1) <- c("Observation 1", "Observation 2", "Observation 3")
knitr::kable(data1, align = "ccc")
Numeric Characters Boolean
Observation 1 1 Hello TRUE
Observation 2 2 Red TRUE
Observation 3 3 UK FALSE

Now that we have created our data frame, we can begin to analyse the data in any way we choose! We will discuss a little later on more complex ways in which we may want to do this. For now, let us look at one more example to make sure we understand how data frames are created:

Height <- rgamma(20, shape = 70, rate = 0.4) # Simulates/generates Gamma Distributed variables
head(Height)
## [1] 170.9156 181.1166 168.5910 167.8381 182.4901 167.1970
Weight <- rnorm(20, mean = 75, sd = 10)
head(Weight)
## [1] 81.09672 79.93742 80.71258 93.52026 89.88747 75.15317
Age <- rpois(20, lambda = 20) # Simulates/generates Poisson Distributed variables
Age
##  [1] 26 25 24 21 13 18 16 18 17 16 16 25 23 25 11 18  9 25 16 32
Sex<- ifelse(rbinom(20,1,prob=0.5) == 1, "Male", "Female") # How has this worked?

Data <- data.frame(Height_cm = Height, Weight_Kg = Weight, Age_Years = Age, Sex = Sex, stringsAsFactors = T)
knitr::kable(Data, align = "cccc")
Height_cm Weight_Kg Age_Years Sex
170.9156 81.09672 26 Male
181.1166 79.93742 25 Female
168.5910 80.71258 24 Male
167.8381 93.52026 21 Male
182.4901 89.88747 13 Female
167.1970 75.15317 18 Female
179.8358 66.55078 16 Female
223.4046 66.04828 18 Female
145.5981 64.27430 17 Male
161.1306 57.32481 16 Female
168.2054 69.40757 16 Male
202.3597 70.67010 25 Female
179.2911 65.57485 23 Female
185.9143 77.64834 25 Female
157.8038 71.06964 11 Male
184.9215 67.06691 18 Female
209.7603 79.70949 9 Female
198.0268 76.11750 25 Male
158.8328 91.74831 16 Female
172.3133 68.90627 32 Male
str(Data)
## 'data.frame':    20 obs. of  4 variables:
##  $ Height_cm: num  171 181 169 168 182 ...
##  $ Weight_Kg: num  81.1 79.9 80.7 93.5 89.9 ...
##  $ Age_Years: int  26 25 24 21 13 18 16 18 17 16 ...
##  $ Sex      : Factor w/ 2 levels "Female","Male": 2 1 2 2 1 1 1 1 2 1 ...
mean(Data$Height_cm)
## [1] 178.2773
mean(Data$Height_cm > 170)
## [1] 0.6

Another helpful tool worth mentioning here when analysing data in a data frame is the table() function, which counts the frequency of different observations and displays them in a table format:

table(Data$Age_Years)
## 
##  9 11 13 16 17 18 21 23 24 25 26 32 
##  1  1  1  4  1  3  1  1  1  4  1  1
table(Data$Age_Years,Data$Sex)
##     
##      Female Male
##   9       1    0
##   11      0    1
##   13      1    0
##   16      3    1
##   17      0    1
##   18      3    0
##   21      0    1
##   23      1    0
##   24      0    1
##   25      3    1
##   26      0    1
##   32      0    1

5.2 Importing data - Excel

In most cases when working with data, the observations will have been collected and stored in another programme which is better built for data collection, e.g. Excel. Fortunately, R can import such data easily and will save the imported values into a data frame type object (known as a tibble) automatically. It is also possible to save it directly as a data frame explicitly if preferred. At this stage, the differences between tibble and a data frame are not important.

To do this, use the following steps:

  1. Click the Import Dataset option within the Environment window (Top right)
  2. Click ‘From Excel…’ (You may be asked to install some packages here, if so press ‘Yes’ or ‘Okay’)
  3. Click `Browse’ to enter your files
  4. Choose the Excel file containing the Data
  5. Edit the Dataset name (if necessary)
  6. Choose which sheet you want to import from the Excel file (if necessary)
  7. Decide if you need to skip any rows due to format
  8. Tick first row as names if appropriate
  9. Press import
  10. Re-save as a data frame (if necessary)

In this example I will import four different data sets (Male and Female Deaths and Population in UK) from the ‘UK(Pop&Death).xls’ data set and name them 1) Male_UK_Death, 2) Female_UK_Death, 3) Male_UK_Pop and 4) Female_UK_Pop, respectively.

Clearly I cannot show you the above steps explicitly in these notes, but they are the steps that have been used to load the data here. Below is an example of what the corresponding code that is executed looks like when following these steps (you may find that your directory path where you have saved the data is different but that should be it):

library(readxl)
Male_UK_Death <- read_excel("UK(Pop&Death).xls", sheet = "UK male deaths")
Female_UK_Death <- read_excel("UK(Pop&Death).xls", sheet = "UK female deaths")
Male_UK_Pop <- read_excel("UK(Pop&Death).xls", sheet = "UK male pop")
Female_UK_Pop <- read_excel("UK(Pop&Death).xls", sheet = "UK female pop")

Note: It is also possible to import data from a number of other sources (SPSS, Stata, SAS, .csv files etc.)

5.3 Manipulating and analysing data

In many cases, the data set that we have imported may not be exactly how we want it for our analysis. There are a huge number of things that you may want/need to do to your data to tidy it up or ‘clean’ the data as it is commonly known. For example, deal with missing data, extract unimportant rows/columns, delete outliers, combine data sets etc.

As an example, let us have a little look at the data sets we created:

We would like to remove the last two rows of this data set as they don’t contain ‘raw’ data.

Male_UK_Death_New <- Male_UK_Death[-c(107, 108),]
datatable(Male_UK_Death_New, options=list(scrollX = TRUE))

Then, in a similar way, we can do the same with the rest of the data sets as they have same problem

Female_UK_Death_New <- Female_UK_Death[-c(107, 108),]
Male_UK_Pop_New <- Male_UK_Death[-c(107, 108),]
Female_UK_Pop_New <- Female_UK_Pop[-c(107, 108),]

We have now tidied up our data to be in a format more beneficial to us and we could start to analyse these individually, i.e. Create plots, calculate statistics, fit statistical models etc. For example:

Aux <- Male_UK_Death_New[-106,]
summary(lm(Aux$Age ~ Aux$`1961`))
## 
## Call:
## lm(formula = Aux$Age ~ Aux$`1961`)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -81.594 -17.102  -2.754   6.391  62.101 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.189e+01  3.686e+00  11.367  < 2e-16 ***
## Aux$`1961`  3.296e-03  7.947e-04   4.147 6.94e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.33 on 103 degrees of freedom
## Multiple R-squared:  0.1431, Adjusted R-squared:  0.1348 
## F-statistic:  17.2 on 1 and 103 DF,  p-value: 6.937e-05

However, what if we were not interested in the individual data sets separated by sex and only cared about the overall deaths and population. That is, we want to combine the data sets. In this case, we need to create a new data frame with the relevant data.

Overall_Deaths <- data.frame(Male_UK_Death_New$Age, Male_UK_Death_New[,-1] + Female_UK_Death_New[,-1])
colnames(Overall_Deaths) <- colnames(Male_UK_Death_New)
datatable(Overall_Deaths, options=list(scrollX=TRUE))
Overall_Pop <- data.frame(Male_UK_Pop_New$Age, Male_UK_Pop_New[,-1]+Female_UK_Pop_New[,-1])
colnames(Overall_Pop) <- colnames(Male_UK_Pop_New)
datatable(Overall_Pop, options=list(scrollX=TRUE))

Exercise 5.1 Use these two data sets to create a data frame consisting of the mortality rates for the UK population over each year. That is, the probability of dying in a given year based on your age?

Before we carry on, let us actually have a look at this data visually now that we have created it:

plot(0:105, Overall_Pop$`1961`, type = "l", main = "UK Population by Age 1961", xlab = "Age", ylab = "Population")

Exercise 5.2 As actuaries, you need to be able to critically analyse data, not just use it. What do you think is the reason for these spikes and falls in the data?

plot(0:105, Overall_Pop$`1961`, type = "l", main = "UK Population by Age 1961", xlab = "Age", ylab = "Population")
abline(v = 14, col = "red", lty = 2)
abline(v = 40, col = "red", lty = 2)
abline(v = 43, col = "blue", lty = 2)

Exercise 5.3 Why would these spikes be of such importance to us as actuaries?

Exercise 5.4 Can you plot the mortality rates for every single year on one plot? Hint: Note that the line colours can also be represented by numbers, e.g., col = 1.

5.4 Conditional extraction

There may be circumstances where you want to consider some analysis of a data set, but you may only be interested in particular observations that satisfy certain conditions. For example, assume we wanted to price a pension annuity (regular payments until death), then we would only be interested in the Mortality rates of pensioners (65+). We can extract this information using the conditional arguments we have seen in previous sessions. However, before we can do this, we have to amend the Age values in the Mortality data set to be numerical/integer values rather than characters:

##   Age    1961    1962    1963    1964    1965    1966    1967    1968    1969
## 1   0 0.04584 0.04606 0.04472 0.04271 0.03995 0.04000 0.03808 0.03815 0.03696
## 2   1 0.00316 0.00299 0.00308 0.00267 0.00263 0.00279 0.00259 0.00255 0.00260
## 3   2 0.00200 0.00173 0.00175 0.00161 0.00162 0.00170 0.00162 0.00164 0.00164
## 4   3 0.00140 0.00137 0.00137 0.00121 0.00132 0.00130 0.00117 0.00137 0.00112
## 5   4 0.00114 0.00110 0.00126 0.00109 0.00116 0.00118 0.00101 0.00109 0.00108
## 6   5 0.00099 0.00106 0.00099 0.00101 0.00102 0.00096 0.00088 0.00092 0.00092
##      1970    1971    1972    1973    1974    1975    1976    1977    1978
## 1 0.03805 0.03619 0.03468 0.03424 0.03391 0.03196 0.02916 0.02905 0.02810
## 2 0.00237 0.00235 0.00236 0.00229 0.00229 0.00210 0.00200 0.00176 0.00199
## 3 0.00151 0.00143 0.00158 0.00144 0.00135 0.00120 0.00118 0.00121 0.00110
## 4 0.00114 0.00112 0.00128 0.00124 0.00116 0.00106 0.00092 0.00097 0.00096
## 5 0.00098 0.00093 0.00097 0.00093 0.00084 0.00085 0.00079 0.00082 0.00077
## 6 0.00085 0.00090 0.00083 0.00090 0.00078 0.00074 0.00076 0.00067 0.00068
##      1979    1980    1981    1982    1983    1984    1985    1986    1987
## 1 0.02702 0.02512 0.02261 0.02241 0.02087 0.01989 0.01922 0.01947 0.01893
## 2 0.00165 0.00170 0.00159 0.00162 0.00143 0.00156 0.00159 0.00137 0.00138
## 3 0.00103 0.00105 0.00101 0.00096 0.00089 0.00084 0.00091 0.00085 0.00085
## 4 0.00081 0.00075 0.00076 0.00067 0.00066 0.00070 0.00072 0.00067 0.00062
## 5 0.00074 0.00071 0.00069 0.00056 0.00063 0.00050 0.00052 0.00054 0.00052
## 6 0.00065 0.00065 0.00055 0.00050 0.00055 0.00049 0.00045 0.00044 0.00045
##      1988    1989    1990    1991    1992    1993    1994    1995    1996
## 1 0.01838 0.01722 0.01642 0.01492 0.01333 0.01303 0.01249 0.01256 0.01276
## 2 0.00133 0.00130 0.00133 0.00129 0.00107 0.00103 0.00104 0.00094 0.00096
## 3 0.00085 0.00085 0.00072 0.00076 0.00067 0.00068 0.00061 0.00054 0.00061
## 4 0.00070 0.00068 0.00061 0.00055 0.00051 0.00052 0.00041 0.00045 0.00039
## 5 0.00050 0.00054 0.00045 0.00044 0.00041 0.00041 0.00037 0.00030 0.00036
## 6 0.00042 0.00048 0.00039 0.00041 0.00036 0.00035 0.00028 0.00034 0.00032
##      1997    1998    1999    2000    2001    2002    2003    2004    2005
## 1 0.01189 0.01160 0.01175 0.01139 0.01131 0.01088 0.01108 0.01061 0.01044
## 2 0.00094 0.00095 0.00094 0.00078 0.00077 0.00086 0.00080 0.00082 0.00079
## 3 0.00057 0.00060 0.00063 0.00046 0.00048 0.00048 0.00050 0.00044 0.00043
## 4 0.00039 0.00041 0.00042 0.00038 0.00036 0.00029 0.00040 0.00037 0.00034
## 5 0.00034 0.00030 0.00030 0.00033 0.00032 0.00035 0.00031 0.00027 0.00021
## 6 0.00031 0.00029 0.00027 0.00026 0.00026 0.00025 0.00027 0.00020 0.00020
##      2006    2007    2008    2009    2010    2011    2012    2013    2014
## 1 0.01036 0.01008 0.00968 0.00957 0.00902 0.00891 0.00837 0.00791 0.00791
## 2 0.00082 0.00075 0.00071 0.00058 0.00064 0.00068 0.00060 0.00063 0.00057
## 3 0.00045 0.00047 0.00039 0.00040 0.00034 0.00037 0.00030 0.00033 0.00031
## 4 0.00031 0.00036 0.00035 0.00027 0.00030 0.00023 0.00023 0.00023 0.00025
## 5 0.00026 0.00026 0.00026 0.00023 0.00021 0.00021 0.00021 0.00021 0.00015
## 6 0.00023 0.00021 0.00024 0.00023 0.00019 0.00020 0.00023 0.00017 0.00017
##      2015    2016    2017
## 1 0.00795 0.00785 0.00796
## 2 0.00057 0.00056 0.00042
## 3 0.00029 0.00032 0.00030
## 4 0.00023 0.00023 0.00021
## 5 0.00020 0.00017 0.00020
## 6 0.00014 0.00018 0.00018
class(Mortality$Age)
## [1] "character"
Mortality$Age <- 0:105
class(Mortality$Age)
## [1] "integer"
Mort_pension <- Mortality[Mortality$Age >= 65, ]
datatable(Mort_pension, options=list(scrollX=TRUE))

How does this work? This is a combination of conditional statements and extraction. In short, the conditional statement creates a vector of TRUE/FALSE values which are then used in the matrix type extraction technique we discussed in previous weeks. The result is that R will only extract the row(column) numbers corresponding to TRUE values. This is known as ‘conditional extraction’.

Moreover, due to changes in technology and the NHS, we may decide that mortality rates pre-2000 are not valid enough to be used in our calculations, so we might only want to consider post-2000 values. Can we do this in a similar way? In general you will not be able to extract certain columns using conditional arguments since they represent different variables. However, you can do this by inspection, using matrix extraction and the which() function:

which(colnames(Mortality) == "2000")
## [1] 41
Mort_pension_2000 <- Mort_pension[,c(1,41:ncol(Mort_pension))]
datatable(Mort_pension_2000, options=list(scrollX=TRUE))

In general, it is possible to use the conditional argument format to extract rows from data sets, as long as you are conditioning on elements/variables within the data rather than column names. To see this, let us briefly revisit the mtcars data set from the previous sessions

datatable(mtcars, options=list(scrollX=TRUE))

Assume that you only want to deal with cars that have 6 or 8 cylinders:

data <- mtcars[mtcars$cyl >= 6, ]
datatable(data, options=list(scrollX=TRUE))
data1 <- mtcars[mtcars$cyl >= 6 & mtcars$hp > 100, ]
datatable(data1, options=list(scrollX=TRUE))

5.5 Adding data

As well as extracting data from data frame, it is also possible to add new data to an existing data frame. That is, add a new observation (row) or even add a new variable (column) which is possibly even calculated using the rest of the data. There are a number of ways to add a new row to a dataframe but I find the easiest is to use something similar to row extraction but for an undefined row (non-existent row). Let us first create a data frame:

new_data <- data.frame(A = c(1,2,3), B = c("Pass", "Fail", "Pass"), C = c("M", "M", "F"))
knitr::kable(new_data, align="ccc")
A B C
1 Pass M
2 Fail M
3 Pass F

Now, we can add a new row/observation to this existing data frame:

new_data[nrow(new_data)+1,] <- c(4, "Fail", "M")
knitr::kable(new_data, align="ccc")
A B C
1 Pass M
2 Fail M
3 Pass F
4 Fail M

However, this only really works when adding a single row. If you wanted to add another data frame, you can use the rowbind function rbind():

add_data <- data.frame(A = c(5,6,7), B = c("Pass", "Pass", "Pass"), C = c("F", "M", "F"))
knitr::kable(add_data, align = "ccc")
A B C
5 Pass F
6 Pass M
7 Pass F
new_data <- rbind(new_data, add_data)
knitr::kable(new_data, align = "ccc")
A B C
1 Pass M
2 Fail M
3 Pass F
4 Fail M
5 Pass F
6 Pass M
7 Pass F

With this in mind, let us revisit the Mortality data set for pensioners post 2000:

datatable(Mort_pension_2000, options=list(scrollX=TRUE))

If we want to add a new column/variable, you can do this in a very similar way as for rows. As an example, let us assume we want to add the mean of the mortality rates across years for the different ages.

Exercise 5.5 Can you create a vector of the mean mortality rates for each age using loops?

Now that we have calculated the value, let us add it to the data. This can be done using any of the following methods:

##  [1] 0.02238 0.02459 0.02689 0.02942 0.03230 0.03530 0.03894 0.04307 0.04712
## [10] 0.05205 0.05707 0.06277 0.06853 0.07508 0.08228 0.09053 0.09886 0.10854
## [19] 0.11907 0.13048 0.14266 0.15547 0.16949 0.18491 0.20157 0.21859 0.23699
## [28] 0.25800 0.27941 0.30027 0.32597 0.35101 0.37298 0.39781 0.42344 0.45314
## [37] 0.48835 0.51654 0.55021 0.57178 0.61910

Method 1

Mort_pension_2000[,ncol(Mort_pension_2000)+1] <- mean_mort
colnames(Mort_pension_2000)[ncol(Mort_pension_2000)] <- c("Mean")
datatable(Mort_pension_2000, options=list(scrollX=TRUE))

Method 2

Mort_pension_2000 <- Mort_pension_2000[,-ncol(Mort_pension_2000)] # This just removes the last column we added

Mort1 <- cbind(Mort_pension_2000, mean_mort)
colnames(Mort1)[ncol(Mort1)] <- c("Mean")
datatable(Mort1, options=list(scrollX=TRUE))

Method 3

Mort2 <- data.frame(Mort_pension_2000, mean_mort)
colnames(Mort2) <- c(colnames(Mort_pension_2000), "Mean")
datatable(Mort2, options=list(scrollX=TRUE))

5.6 The apply family

The final tool I want to talk about in this R module is the family of functions known as the apply() functions. Put simply, the apply() function, along with the its counterparts lapply(), sapply() and vapply() allow us to ‘apply’ a particular function on each row and/or column of a data frame (and other objects) without using loops.

The apply() function is basically a quicker and more convenient version of a for() loop and should always be considered first before loops as they are easier to write, read and are a lot quicker to execute which makes a huge difference when working with larger data sets.

The apply() function takes 3 main inputs (it can take more and we will discuss this soon). The first is the object you want to ‘apply’ the function to, in our case today a data frame. The second is either of the following: 1, 2, or c(1,2) where 1 indicates you want the function to be applied to the rows of the object,2 for the columns of the object and c(1,2) to individual elements. Finally, the third input is the name of the function we want to apply.

As an example, let us look at finding the means for each row of the Mort_pension_2000 data frame as we did before using for() loops.

apply(Mort_pension_2000[,-1], 1, mean)
##         66         67         68         69         70         71         72 
## 0.02238278 0.02459333 0.02689389 0.02942444 0.03229556 0.03530389 0.03894056 
##         73         74         75         76         77         78         79 
## 0.04306944 0.04711944 0.05205222 0.05707333 0.06277167 0.06853056 0.07508111 
##         80         81         82         83         84         85         86 
## 0.08228333 0.09053167 0.09885667 0.10854167 0.11906778 0.13048222 0.14266278 
##         87         88         89         90         91         92         93 
## 0.15546722 0.16949222 0.18490833 0.20157111 0.21858556 0.23699167 0.25800444 
##         94         95         96         97         98         99        100 
## 0.27940611 0.30027000 0.32596778 0.35100611 0.37298056 0.39781444 0.42343833 
##        101        102        103        104        105        106 
## 0.45313778 0.48834833 0.51654222 0.55021222 0.57178500 0.61909778

Does the output seem a little strange? You should notice that the row names are not the actual ages but the row number the ages were on! Again, be careful. Although, this can be easily rectified:

rownames(Mort_pension_2000) <- 65:105
apply(Mort_pension_2000[,-1], 1, mean)
##         65         66         67         68         69         70         71 
## 0.02238278 0.02459333 0.02689389 0.02942444 0.03229556 0.03530389 0.03894056 
##         72         73         74         75         76         77         78 
## 0.04306944 0.04711944 0.05205222 0.05707333 0.06277167 0.06853056 0.07508111 
##         79         80         81         82         83         84         85 
## 0.08228333 0.09053167 0.09885667 0.10854167 0.11906778 0.13048222 0.14266278 
##         86         87         88         89         90         91         92 
## 0.15546722 0.16949222 0.18490833 0.20157111 0.21858556 0.23699167 0.25800444 
##         93         94         95         96         97         98         99 
## 0.27940611 0.30027000 0.32596778 0.35100611 0.37298056 0.39781444 0.42343833 
##        100        101        102        103        104        105 
## 0.45313778 0.48834833 0.51654222 0.55021222 0.57178500 0.61909778

Using this simple execution, we could add the mean column to our original data set using the following lines of code:

Mort3 <- cbind(Mort_pension_2000, round(apply(Mort_pension_2000[,-1], 1, mean),digits = 5))
colnames(Mort3)[ncol(Mort3)] <- "Mean"
datatable(Mort3, options=list(scrollX=TRUE))

Exercise 5.6 With this in mind, can you now add the standard deviation for each age to the end of Mort3, using the apply() function?

Just so you can see how this works let’s look at the same function applied to columns:

apply(Mort_pension_2000[,-1], 2, mean)
##      2000      2001      2002      2003      2004      2005      2006      2007 
## 0.2137400 0.2107917 0.2145129 0.2197032 0.2065780 0.2114468 0.2028490 0.2057312 
##      2008      2009      2010      2011      2012      2013      2014      2015 
## 0.2130668 0.1988127 0.1993746 0.1943022 0.2038222 0.2058556 0.1969544 0.2119524 
##      2016      2017 
## 0.2002280 0.2057351

Finally, let us try on the individual elements (obviously we will have to use something other than mean here):

head(apply(Mort_pension_2000, c(1,2), class))
##    Age       2000      2001      2002      2003      2004      2005     
## 65 "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## 66 "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## 67 "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## 68 "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## 69 "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## 70 "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
##    2006      2007      2008      2009      2010      2011      2012     
## 65 "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## 66 "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## 67 "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## 68 "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## 69 "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## 70 "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
##    2013      2014      2015      2016      2017     
## 65 "numeric" "numeric" "numeric" "numeric" "numeric"
## 66 "numeric" "numeric" "numeric" "numeric" "numeric"
## 67 "numeric" "numeric" "numeric" "numeric" "numeric"
## 68 "numeric" "numeric" "numeric" "numeric" "numeric"
## 69 "numeric" "numeric" "numeric" "numeric" "numeric"
## 70 "numeric" "numeric" "numeric" "numeric" "numeric"

You can use ANY function in the apply() command, including those which have multiple inputs - in this case, the additional inputs are just input as additional inputs into the apply() function itself - and custom created functions like those we created in the previous week. As a final example, let us create a function and use it within apply() on the Mortality data:

min_max <- function(x, type){
  ordered <- sort(x)
  if(type == "min"){
    return (ordered[1])
  } else if (type == "max"){
    return(ordered[length(ordered)])
  }
}

apply(Mort_pension_2000[,-1], 1, min_max, type = "min")
##      65      66      67      68      69      70      71      72      73      74 
## 0.01883 0.02048 0.02189 0.02463 0.02695 0.02853 0.03255 0.03683 0.04035 0.04411 
##      75      76      77      78      79      80      81      82      83      84 
## 0.04971 0.05459 0.05976 0.06545 0.07277 0.08134 0.08918 0.09897 0.11005 0.12088 
##      85      86      87      88      89      90      91      92      93      94 
## 0.13334 0.14586 0.15333 0.16419 0.17963 0.19210 0.20916 0.22064 0.25244 0.27843 
##      95      96      97      98      99     100     101     102     103     104 
## 0.29249 0.32928 0.34647 0.37149 0.39778 0.42035 0.46917 0.48428 0.49442 0.50242 
##     105 
## 0.55556
apply(Mort_pension_2000[,-1], 1, min_max, type = "max")
##      65      66      67      68      69      70      71      72      73      74 
## 0.02785 0.03085 0.03358 0.03696 0.04059 0.04390 0.04872 0.05385 0.05899 0.06575 
##      75      76      77      78      79      80      81      82      83      84 
## 0.07002 0.07534 0.08223 0.08847 0.09495 0.10217 0.11028 0.12229 0.13242 0.14378 
##      85      86      87      88      89      90      91      92      93      94 
## 0.15974 0.17032 0.18645 0.19645 0.21615 0.23281 0.24830 0.27119 0.30073 0.32529 
##      95      96      97      98      99     100     101     102     103     104 
## 0.34906 0.37129 0.39790 0.43153 0.44902 0.49334 0.53085 0.55721 0.60396 0.62236 
##     105 
## 0.67361

There are some very subtle differences between lapply(), sapply() and vapply() that I will not go into here but please make sure to work through the DataCamp courses to understand these. For the sake of this course, the apply() function allows us to do what we want sufficiently! Feel free to now have a go at the final set of exercises below:

I hope you enjoyed this small workshop on R Programming and feel more confident with the basics of what you can do in R. I strongly suggest you continuously test yourself in R and even make up your own problems/challenges, as physically programming is really the only way to remember and improve your programming skills. Also, please do not be afraid to search the web for tips and advice, I personally find this the simplest and quickest way to learn.

I am always more than happy to help with any questions you may have, so do please not hesitate to contact me!

5.7 Exercises

  1. A traffic engineer is investigating the consistency and use of public buses in York. Based on previous data, she finds that the number of buses arriving each day fluctuates due to poor timetabling and assumes that the number of buses (arriving each day) is Poisson distributed with parameter \(\lambda = 10\). Create a vector containing 30 simulated values from this distribution to represent a simulation of a months worth of observations.
  1. Based on the same data, it is assumed that the average number of people on each bus follows a binomial distribution with parameters \(n = 60\) and \(p = 0.4\). Create a second vector containing 30 simulated values which represents the average number of people on each bus per day.
  1. Finally, the average age of the passengers riding the buses on any given day is thought to be rounded gamma distributed with parameters \(\alpha = 250\) and \(\beta = 5\). Create a vector containing 30 simulated values to represent the average age of passengers on a given day.
  1. Using the vectors created above, create a data frame with 4 columns representing the days of the week (starting from Monday), the number of buses arriving per day, the average number of passengers per bus and the average age of passengers on that given day.
  1. Add a new column to the data which shows the average number of passengers each day.
  1. Using conditional row/column extraction and the apply() function, determine the mean and standard deviation for each of the variables within the week and at the weekend separately. Compare the results of weekday vs. weekend using a bar chart. [Hint: We have not spoken explicitly about how to use bar charts, so you will have to read up on these yourselves.]
  1. Import the ‘Stock Price’ data from the Excel file on the VLE and save it as a data frame. By creating a function and using apply(), add new columns to this data frame containing the relative returns of each stock and label the new columns appropriately.
  1. Using the relative returns calculated in Exercise 7., create a plot which compares the mean relative returns per year for each of the stocks.

[Hint: You can extract the year from the date column of the data frame using the following code - format(date, format = "\%Y").]

  1. Using the apply() function, calculate the standard deviations of the returns for the different stocks within each year.

5.8 DataCamp course(s)