Chapter7 Data in R
Up to this point, we have mostly worked with small objects that we created ourselves inside R. In practice, this is rarely how data analysis happens. As actuaries and financial analysts, you will almost always work with data that comes from external sources:
- Excel spreadsheets
- CSV files
- databases
- statistical software
- administrative records
- messy, incomplete real-world datasets
Unfortunately, real datasets are rarely clean. Common problems that arise include:
- missing values
- inconsistent formats
- incorrect data types
- duplicated records
- impossible values (negative ages, probabilities > 1, etc.)
- mixed numeric and text data
R will often try to “guess” what your data means when importing it. Sometimes it guesses correctly and sometimes it does not! A major skill in actuarial work is learning to inspect and question your data before using it.
This chapter focuses on the full workflow of working with real data, i.e. \[ \textbf{import} \rightarrow \textbf{check} \rightarrow \textbf{clean}\rightarrow \textbf{validate} \rightarrow \textbf{summarise} \] The goal is not just to teach syntax, but to develop good professional habits when handling data.
7.1 Importing data
In most cases when working with data, the observations will have been collected and stored in another programme which is better built for data collection. The two most common formats you will encounter are:
- CSV (comma-separated values)
- Excel spreadsheets
CSV files are generally safer and cleaner than Excel, but both are widely used.
Fortunately, R can import such data easily and will save the imported values into a data frame type object (known as a tibble) automatically. It is also possible to save it directly as a data frame, if preferred. At this stage, the differences between tibble and a data frame are not important.
7.1.1 Importing data from a CSV files
CSV files are simple text files where information (observations, variables etc.) are separated by simple commas (hence the name). In their raw form, these files are not appropriate to work with directly. However, their simple structure makes them much easy to handle when importing in various statistical software.
To import a CSV into R, we can simply use the read.csv() function, with the file path used as an input. If your data file exists within your working directory (the file in which R is working directly and searching for files), you need only include the datasets file name:
7.1.2 Importing data from Excel
To import data into R from Excel, use the following steps:
- Click the
Import Datasetoption within the Environment window (Top right) - Click ‘From Excel…’ (You may be asked to install some packages here, if so press ‘Yes’ or ‘Okay’)
- Click `Browse’ to enter your files
- Choose the Excel file containing the data
- Edit the dataset name (if necessary)
- Choose which sheet you want to import from the Excel file (if necessary)
- Decide if you need to skip any rows due to format
- Tick first row as names (if appropriate)
- Press “Import”
- Re-save as a data frame (if necessary)
In this example I will import four different data sets (Male and Female Deaths and Population in UK) from the ‘UK(Pop&Death).xls’ data set and name them 1) Male_UK_Death, 2) Female_UK_Death, 3) Male_UK_Pop and 4) Female_UK_Pop, respectively.
Obviously I cannot show you the above steps explicitly in these notes, but they are the steps that have been used to load the data here. Below is an example of what the corresponding code that appears in the console when following these steps (you may find that your directory path where you have saved the data is different but that should be the only difference):
library(readxl)
Male_UK_Death <- read_excel("UK(Pop&Death).xls", sheet = "UK male deaths")
Female_UK_Death <- read_excel("UK(Pop&Death).xls", sheet = "UK female deaths")
Male_UK_Pop <- read_excel("UK(Pop&Death).xls", sheet = "UK male pop")
Female_UK_Pop <- read_excel("UK(Pop&Death).xls", sheet = "UK female pop")7.2 Inspecting the data
Immediately after importing data, you should always inspect it. There are many way to do this, but at the very least I would suggest using the following three essential commands:
## tibble [108 × 58] (S3: tbl_df/tbl/data.frame)
## $ Age : chr [1:108] "0" "1" "2" "3" ...
## $ 1961: num [1:108] 12047 785 484 305 261 ...
## $ 1962: num [1:108] 12707 727 423 300 258 ...
## $ 1963: num [1:108] 12408 788 437 324 297 ...
## $ 1964: num [1:108] 11989 684 406 289 262 ...
## $ 1965: num [1:108] 11325 671 407 354 280 ...
## $ 1966: num [1:108] 11105 726 471 350 305 ...
## $ 1967: num [1:108] 10254 649 435 340 266 ...
## $ 1968: num [1:108] 10319 644 439 381 289 ...
## $ 1969: num [1:108] 9894 644 453 311 305 ...
## $ 1970: num [1:108] 9713 590 394 314 258 ...
## $ 1971: num [1:108] 9366 531 374 279 255 ...
## $ 1972: num [1:108] 8393 556 357 310 237 ...
## $ 1973: num [1:108] 7781 523 365 297 255 ...
## $ 1974: num [1:108] 7180 495 311 299 202 214 189 179 140 136 ...
## $ 1975: num [1:108] 6392 402 273 255 209 ...
## $ 1976: num [1:108] 5706 388 243 217 188 ...
## $ 1977: num [1:108] 5350 314 235 181 188 167 173 131 143 130 ...
## $ 1978: num [1:108] 5220 338 194 182 152 156 152 141 125 152 ...
## $ 1979: num [1:108] 5447 265 188 149 146 ...
## $ 1980: num [1:108] 5174 319 194 131 148 ...
## $ 1981: num [1:108] 4759 310 194 137 130 ...
## $ 1982: num [1:108] 4555 335 177 143 105 ...
## $ 1983: num [1:108] 4230 268 171 137 119 100 97 94 85 93 ...
## $ 1984: num [1:108] 3995 312 165 155 93 ...
## $ 1985: num [1:108] 4003 281 186 147 114 ...
## $ 1986: num [1:108] 4219 255 158 139 101 ...
## $ 1987: num [1:108] 4105 264 171 116 106 ...
## $ 1988: num [1:108] 4110 281 168 135 96 88 84 91 96 74 ...
## $ 1989: num [1:108] 3799 275 182 140 102 ...
## $ 1990: num [1:108] 3614 273 175 126 100 ...
## $ 1991: num [1:108] 3377 267 152 132 85 ...
## $ 1992: num [1:108] 2954 218 135 105 101 ...
## $ 1993: num [1:108] 2757 223 158 108 91 ...
## $ 1994: num [1:108] 2644 207 125 88 85 ...
## $ 1995: num [1:108] 2575 177 115 100 62 ...
## $ 1996: num [1:108] 2575 192 120 85 87 ...
## $ 1997: num [1:108] 2414 191 115 81 78 ...
## $ 1998: num [1:108] 2315 190 120 84 71 ...
## $ 1999: num [1:108] 2323 190 118 84 67 ...
## $ 2000: num [1:108] 2136 156 102 69 63 ...
## $ 2001: num [1:108] 2052 143 91 62 62 ...
## $ 2002: num [1:108] 2050 160 97 58 67 48 43 43 37 53 ...
## $ 2003: num [1:108] 2047 133 87 77 59 ...
## $ 2004: num [1:108] 2033 150 79 70 46 ...
## $ 2005: num [1:108] 2117 143 90 66 40 ...
## $ 2006: num [1:108] 2078 146 85 53 44 ...
## $ 2007: num [1:108] 2113 154 104 70 51 ...
## $ 2008: num [1:108] 2123 144 70 60 45 ...
## $ 2009: num [1:108] 2067 127 73 42 34 ...
## $ 2010: num [1:108] 1915 129 77 61 39 ...
## $ 2011: num [1:108] 2049 143 81 46 40 ...
## $ 2012: num [1:108] 1912 134 74 42 46 ...
## $ 2013: num [1:108] 1741 135 74 45 44 ...
## $ 2014: num [1:108] 1646 142 72 56 30 ...
## $ 2015: num [1:108] 1752 119 67 53 49 ...
## $ 2016: num [1:108] 1654 111 64 51 34 ...
## $ 2017: num [1:108] 1664 98 58 47 47 ...
## # A tibble: 6 × 58
## Age `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968` `1969` `1970`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 12047 12707 12408 11989 11325 11105 10254 10319 9894 9713
## 2 1 785 727 788 684 671 726 649 644 644 590
## 3 2 484 423 437 406 407 471 435 439 453 394
## 4 3 305 300 324 289 354 350 340 381 311 314
## 5 4 261 258 297 262 280 305 266 289 305 258
## 6 5 218 217 235 247 255 258 251 259 265 218
## # ℹ 47 more variables: `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>,
## # `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>,
## # `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
## # `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>,
## # `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>,
## # `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>,
## # `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, `2004` <dbl>, …
## Age 1961 1962 1963
## Length:108 Min. : 0 Min. : 1.0 Min. : 1.0
## Class :character 1st Qu.: 341 1st Qu.: 325.5 1st Qu.: 345.5
## Mode :character Median : 990 Median : 983.0 Median : 1031.0
## Mean : 6019 Mean : 6089.2 Mean : 6258.7
## 3rd Qu.: 6458 3rd Qu.: 6589.0 3rd Qu.: 6696.5
## Max. :322028 Max. :325770.0 Max. :334841.0
## NA's :1 NA's :1 NA's :1
## 1964 1965 1966 1967
## Min. : 1 Min. : 0 Min. : 1.0 Min. : 0.0
## 1st Qu.: 344 1st Qu.: 347 1st Qu.: 346.5 1st Qu.: 342.5
## Median : 1009 Median : 1052 Median : 983.0 Median : 922.0
## Mean : 5880 Mean : 6039 Mean : 6163.7 Mean : 5896.5
## 3rd Qu.: 6227 3rd Qu.: 6321 3rd Qu.: 6352.0 3rd Qu.: 6100.0
## Max. :314573 Max. :323083 Max. :329757.0 Max. :315461.0
## NA's :1 NA's :1 NA's :1 NA's :1
## 1968 1969 1970 1971
## Min. : 2.0 Min. : 1 Min. : 0.0 Min. : 1.0
## 1st Qu.: 337.5 1st Qu.: 345 1st Qu.: 343.5 1st Qu.: 342.5
## Median : 1025.0 Median : 985 Median : 1035.0 Median : 990.0
## Mean : 6235.4 Mean : 6314 Mean : 6249.6 Mean : 6140.9
## 3rd Qu.: 6465.5 3rd Qu.: 6301 3rd Qu.: 6205.0 3rd Qu.: 6008.5
## Max. :333594.0 Max. :337800 Max. :334355.0 Max. :328537.0
## NA's :1 NA's :1 NA's :1 NA's :1
## 1972 1973 1974 1975
## Min. : 1.0 Min. : 2.0 Min. : 3 Min. : 3.0
## 1st Qu.: 344.5 1st Qu.: 378.5 1st Qu.: 363 1st Qu.: 359.5
## Median : 963.0 Median : 996.0 Median : 951 Median : 887.0
## Mean : 6403.8 Mean : 6332.5 Mean : 6304 Mean : 6261.8
## 3rd Qu.: 6230.5 3rd Qu.: 6207.5 3rd Qu.: 6130 3rd Qu.: 5858.0
## Max. :342605.0 Max. :338788.0 Max. :337263 Max. :335006.0
## NA's :1 NA's :1 NA's :1 NA's :1
## 1976 1977 1978 1979
## Min. : 4 Min. : 5 Min. : 4 Min. : 5
## 1st Qu.: 364 1st Qu.: 352 1st Qu.: 365 1st Qu.: 349
## Median : 859 Median : 881 Median : 836 Median : 874
## Mean : 6391 Mean : 6167 Mean : 6288 Mean : 6347
## 3rd Qu.: 5753 3rd Qu.: 5340 3rd Qu.: 5436 3rd Qu.: 5588
## Max. :341910 Max. :329924 Max. :336395 Max. :339568
## NA's :1 NA's :1 NA's :1 NA's :1
## 1980 1981 1982 1983
## Min. : 3.0 Min. : 7 Min. : 8.0 Min. : 3.0
## 1st Qu.: 340.5 1st Qu.: 329 1st Qu.: 331.5 1st Qu.: 324.5
## Median : 898.0 Median : 869 Median : 793.0 Median : 846.0
## Mean : 6212.5 Mean : 6152 Mean : 6167.7 Mean : 6146.2
## 3rd Qu.: 5364.5 3rd Qu.: 5518 3rd Qu.: 5514.5 3rd Qu.: 5400.0
## Max. :332370.0 Max. :329145 Max. :329971.0 Max. :328824.0
## NA's :1 NA's :1 NA's :1 NA's :1
## 1984 1985 1986 1987
## Min. : 3 Min. : 5.0 Min. : 6.0 Min. : 8.0
## 1st Qu.: 332 1st Qu.: 331.5 1st Qu.: 342.5 1st Qu.: 354.5
## Median : 768 Median : 808.0 Median : 868.0 Median : 873.0
## Mean : 6002 Mean : 6197.4 Mean : 6115.1 Mean : 5949.2
## 3rd Qu.: 5184 3rd Qu.: 5447.5 3rd Qu.: 5522.5 3rd Qu.: 5251.0
## Max. :321095 Max. :331562.0 Max. :327159.0 Max. :318282.0
## NA's :1 NA's :1 NA's :1 NA's :1
## 1988 1989 1990 1991
## Min. : 14 Min. : 8 Min. : 10 Min. : 9
## 1st Qu.: 374 1st Qu.: 381 1st Qu.: 396 1st Qu.: 387
## Median : 887 Median : 927 Median : 911 Median : 1002
## Mean : 5965 Mean : 5985 Mean : 5880 Mean : 5877
## 3rd Qu.: 5299 3rd Qu.: 5220 3rd Qu.: 4948 3rd Qu.: 5036
## Max. :319119 Max. :320193 Max. :314601 Max. :314427
## NA's :1 NA's :1 NA's :1 NA's :1
## 1992 1993 1994 1995
## Min. : 11 Min. : 12 Min. : 12.0 Min. : 18
## 1st Qu.: 335 1st Qu.: 334 1st Qu.: 307.5 1st Qu.: 316
## Median : 885 Median : 901 Median : 896.0 Median : 885
## Mean : 5767 Mean : 5935 Mean : 5656.2 Mean : 5775
## 3rd Qu.: 5053 3rd Qu.: 5042 3rd Qu.: 4822.0 3rd Qu.: 4916
## Max. :308535 Max. :317537 Max. :302607.0 Max. :308982
## NA's :1 NA's :1 NA's :1 NA's :1
## 1996 1997 1998 1999
## Min. : 14 Min. : 8 Min. : 9.0 Min. : 10
## 1st Qu.: 298 1st Qu.: 313 1st Qu.: 321.5 1st Qu.: 299
## Median : 872 Median : 908 Median : 935.0 Median : 920
## Mean : 5728 Mean : 5640 Mean : 5601.0 Mean : 5593
## 3rd Qu.: 4935 3rd Qu.: 4663 3rd Qu.: 4770.0 3rd Qu.: 4722
## Max. :306452 Max. :301713 Max. :299655.0 Max. :299235
## NA's :1 NA's :1 NA's :1 NA's :1
## 2000 2001 2002 2003
## Min. : 15.0 Min. : 16.0 Min. : 12.0 Min. : 15.0
## 1st Qu.: 307.5 1st Qu.: 302.5 1st Qu.: 307.5 1st Qu.: 300.5
## Median : 910.0 Median : 978.0 Median : 1025.0 Median : 949.0
## Mean : 5445.6 Mean : 5382.0 Mean : 5403.4 Mean : 5405.3
## 3rd Qu.: 4617.5 3rd Qu.: 4475.0 3rd Qu.: 4553.5 3rd Qu.: 4603.5
## Max. :291337.0 Max. :287939.0 Max. :289081.0 Max. :289185.0
## NA's :1 NA's :1 NA's :1 NA's :1
## 2004 2005 2006 2007
## Min. : 13.0 Min. : 19.0 Min. : 21 Min. : 22
## 1st Qu.: 298.5 1st Qu.: 282.5 1st Qu.: 288 1st Qu.: 296
## Median : 978.0 Median : 1067.0 Median : 1043 Median : 1064
## Mean : 5213.4 Mean : 5184.1 Mean : 5125 Mean : 5138
## 3rd Qu.: 4543.5 3rd Qu.: 4389.0 3rd Qu.: 4225 3rd Qu.: 4172
## Max. :278918.0 Max. :277349.0 Max. :274201 Max. :274890
## NA's :1 NA's :1 NA's :1 NA's :1
## 2008 2009 2010 2011
## Min. : 26.0 Min. : 26 Min. : 24.0 Min. : 30
## 1st Qu.: 300.5 1st Qu.: 273 1st Qu.: 264.5 1st Qu.: 236
## Median : 1137.0 Median : 1113 Median : 1137.0 Median : 1097
## Mean : 5172.8 Mean : 5062 Mean : 5064.4 Mean : 5000
## 3rd Qu.: 4328.0 3rd Qu.: 4131 3rd Qu.: 4367.5 3rd Qu.: 4557
## Max. :276745.0 Max. :270804 Max. :270945.0 Max. :267491
## NA's :1 NA's :1 NA's :1 NA's :1
## 2012 2013 2014 2015
## Min. : 27.0 Min. : 23.0 Min. : 26 Min. : 29.0
## 1st Qu.: 227.5 1st Qu.: 233.5 1st Qu.: 226 1st Qu.: 239.5
## Median : 1061.0 Median : 1106.0 Median : 1100 Median : 1069.0
## Mean : 5109.3 Mean : 5218.1 Mean : 5205 Mean : 5471.2
## 3rd Qu.: 4792.5 3rd Qu.: 4941.0 3rd Qu.: 5018 3rd Qu.: 5326.0
## Max. :273347.0 Max. :279171.0 Max. :278455 Max. :292707.0
## NA's :1 NA's :1 NA's :1 NA's :1
## 2016 2017
## Min. : 24.0 Min. : 24.0
## 1st Qu.: 252.5 1st Qu.: 216.5
## Median : 1073.0 Median : 1076.0
## Mean : 5476.6 Mean : 5585.9
## 3rd Qu.: 5051.0 3rd Qu.: 5054.0
## Max. :293001.0 Max. :298843.0
## NA's :1 NA's :1
These commands answer three critical questions:
- What structure does the dataset have?
- What do the first few rows look like?
- Are the values sensible?
This habit is more important than memorising any function. In addition, it is important to keep in mind that R hasn’t been told what the different types of variables actually are and as such, guesses. However, these are not always correct. It is important to make sure you analsye each variable and, where necessary, re-assign the variable type, e.g., numeric, factor, date/time etc. This is important as many statistical functions behave differently depending on the variable type. A column treated as text instead of numeric may silently break your analysis.
Finally, when inspecting the data, make sure to look out for missing data. This is one of the most important concepts in data analysis. In R, missing values are stored as NA and you must explicitly handle them. To detect NA values we can use is.na() which will return a TRUE/FALSE logical value:
## [1] 58
If your dataset is missing values, these can cause issues in numerous calculations. In this case, we need to handle the NA values beforehand.
One approacho is to simply assign random values to these that are in-line with the rest of the data:
## [1] 1
## [1] 6019.215
Another approach is to remove these values from the dataset:
## [1] 0
NOTE: If you do not understand how missing values are handled, you do not understand your result.
7.3 Cleaning and preparing data
In many cases, the data set that we have imported may not be exactly how we want it for our analysis. There are a huge number of things that you may want/need to do to your data to tidy it up or ‘clean’ the data as it is commonly known. For example, deal with missing data (as. above), extract unimportant rows/columns, rename columns, delete outliers, create new variables etc.
As an example, let us have a little look at the data sets we created:
We would like to remove the last two rows of this data set as they don’t contain ‘raw’ data.
Male_UK_Death_New <- Male_UK_Death[-c(107, 108),]
datatable(Male_UK_Death_New, options=list(scrollX = TRUE))We can do the same with the rest of the data sets as they have same issue:
Female_UK_Death_New <- Female_UK_Death[-c(107, 108),]
Male_UK_Pop_New <- Male_UK_Death[-c(107, 108),]
Female_UK_Pop_New <- Female_UK_Pop[-c(107, 108),]We have now tidied up our data to be in a format more beneficial to us and we could start to analyse these individually, i.e. create plots, calculate statistics, fit statistical models etc. For example:
##
## Call:
## lm(formula = Aux$Age ~ Aux$`1961`)
##
## Residuals:
## Min 1Q Median 3Q Max
## -81.594 -17.102 -2.754 6.391 62.101
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.189e+01 3.686e+00 11.367 < 2e-16 ***
## Aux$`1961` 3.296e-03 7.947e-04 4.147 6.94e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.33 on 103 degrees of freedom
## Multiple R-squared: 0.1431, Adjusted R-squared: 0.1348
## F-statistic: 17.2 on 1 and 103 DF, p-value: 6.937e-05
What if we were not interested in the individual data sets separated by sex and only cared about the overall deaths and population. That is, we want to combine the data sets. In this case, we need to create a new data frame with the relevant data:
Overall_Deaths <- data.frame(Male_UK_Death_New$Age, Male_UK_Death_New[,-1] + Female_UK_Death_New[,-1])
colnames(Overall_Deaths) <- colnames(Male_UK_Death_New)
datatable(Overall_Deaths, options=list(scrollX=TRUE))Overall_Pop <- data.frame(Male_UK_Pop_New$Age, Male_UK_Pop_New[,-1]+Female_UK_Pop_New[,-1])
colnames(Overall_Pop) <- colnames(Male_UK_Pop_New)
datatable(Overall_Pop, options=list(scrollX=TRUE))Mortality <- data.frame(Male_UK_Death_New$Age, Mortality = round(Overall_Deaths[,-1]/Overall_Pop[,-1], digits = 5))
datatable(Mortality, options=list(scrollX=TRUE))Note - We could have calculated the mortality rates on the individual sex data sets to obtain a more accurate estimate of mortality based on sex.
Before we carry on, let us have a look at this data visually now that we have created it:
plot(0:105, Overall_Pop$`1961`, type = "l", main = "UK Population by Age 1961", xlab = "Age", ylab = "Population")
Exercise 7.1 As actuaries, you need to be able to critically analyse data, not just use it. What do you think is the reason for these spikes and falls in the data and why would these spikes be of such importance to us as actuaries?
plot(0:105, Overall_Pop$`1961`, type = "l", main = "UK Population by Age 1961", xlab = "Age", ylab = "Population")
abline(v = 14, col = "red", lty = 2)
abline(v = 40, col = "red", lty = 2)
abline(v = 43, col = "blue", lty = 2)
Let us also look at the overall mortality rates by age:
plot(0:105, Mortality[,2], type = "l", xlab = "Age", ylab = "Mortality Rates", main = "UK Mortality by Age")
for(i in 3:(ncol(Mortality)-2)){
lines(0:105, Mortality[,i], type = "l", col = i)
}
7.3.1 Validating your data
Cleaning your data is not quite enough. You must also check whether the data you actually have is valid and makes sense. For example, you may want to ask validation questions, such as:
- Are there negative values where none should exist?
- Are ages realistic?
- Are probabilities between 0 and 1?
- Are there duplicates?
## [1] 0
## [1] "0" "99"
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
These checks are simple but powerful. Keep in mind, many modelling errors come from unnoticed data problems. Actuaries must always validate before trusting results.
–> –>
–> –> –> –> –> –> –> –> –>
7.4 The apply family
The final tool I want to talk about in this R module is the family of functions known as the apply() functions. Put simply, the apply() function, along with the its counterparts lapply(), sapply() and vapply() allow us to ‘apply’ a particular function on each row and/or column of a data frame (and other objects) without using loops.
The apply() function is basically a quicker and more convenient version of a for() loop and should always be considered first before loops as they are easier to write, read and are a lot quicker to execute which makes a huge difference when working with larger data sets.
The apply() function takes 3 main inputs (it can take more and we will discuss this soon). The first is the object you want to ‘apply’ the function to, e.g., a data frame. The second is either of the following: 1, 2, or c(1,2) where 1 indicates you want the function to be applied to the rows of the object,2 for the columns of the object and c(1,2) to individual elements. Finally, the third input is the function you want to apply.
As an example, let us look at finding the means for each row of the Mortality data frame created earlier:
## [1] 0.0210698246 0.0014724561 0.0008921053 0.0006880702 0.0005698246
## [6] 0.0005021053 0.0004552632 0.0004131579 0.0003922807 0.0003787719
## [11] 0.0003708772 0.0003808772 0.0004050877 0.0004466667 0.0005147368
## [16] 0.0006096491 0.0008249123 0.0011131579 0.0012385965 0.0012468421
## [21] 0.0012501754 0.0012307018 0.0012184211 0.0012080702 0.0011992982
## [26] 0.0012268421 0.0012624561 0.0012863158 0.0013300000 0.0013929825
## [31] 0.0014724561 0.0015310526 0.0016566667 0.0017384211 0.0018745614
## [36] 0.0020342105 0.0021854386 0.0023766667 0.0025982456 0.0028605263
## [41] 0.0031626316 0.0034724561 0.0038600000 0.0042473684 0.0046714035
## [46] 0.0052145614 0.0057436842 0.0063966667 0.0069812281 0.0077947368
## [51] 0.0086407018 0.0094077193 0.0104771930 0.0115415789 0.0126954386
## [56] 0.0138889474 0.0153401754 0.0169214035 0.0185714035 0.0204166667
## [61] 0.0224610526 0.0244631579 0.0269040351 0.0292515789 0.0320694737
## [66] 0.0347478947 0.0374401754 0.0411029825 0.0443675439 0.0482424561
## [71] 0.0522892982 0.0562059649 0.0618628070 0.0669026316 0.0726143860
## [76] 0.0780640351 0.0847368421 0.0907712281 0.0983756140 0.1064989474
## [81] 0.1156568421 0.1241424561 0.1359135088 0.1472559649 0.1599692982
## [86] 0.1729105263 0.1879768421 0.2014992982 0.2162587719 0.2339608772
## [91] 0.2491621053 0.2658356140 0.2889261404 0.3105215789 0.3314791228
## [96] 0.3535989474 0.3806291228 0.3973307018 0.4223912281 0.4378859649
## [101] 0.4747017544 0.5070512281 0.5217501754 0.5588196491 0.5916705263
## [106] 0.6339664912
Using this simple execution, we could add the mean column to our original data set using the following lines of code:
Mortality <- cbind(Mortality, round(apply(Mortality[,-1], 1, mean), digits = 5))
colnames(Mortality)[ncol(Mortality)] <- "Mean"
datatable(Mortality, options=list(scrollX=TRUE))Exercise 7.2 With this in mind, can you now add the standard deviation for each age to the end of Mortality, using the apply() function?
Just so you can see how this works let’s look at the same function applied to columns:
## Mortality.1961 Mortality.1962 Mortality.1963 Mortality.1964 Mortality.1965
## 0.11996943 0.11418189 0.11808283 0.10151491 0.10242660
## Mortality.1966 Mortality.1967 Mortality.1968 Mortality.1969 Mortality.1970
## 0.10994462 0.09871443 0.11012717 0.10943538 0.10920491
## Mortality.1971 Mortality.1972 Mortality.1973 Mortality.1974 Mortality.1975
## 0.10152500 0.10327151 0.10089660 0.10494651 0.09937877
## Mortality.1976 Mortality.1977 Mortality.1978 Mortality.1979 Mortality.1980
## 0.10209642 0.09486764 0.09701868 0.09346330 0.09503462
## Mortality.1981 Mortality.1982 Mortality.1983 Mortality.1984 Mortality.1985
## 0.09516453 0.09313057 0.09219396 0.09105132 0.09378877
## Mortality.1986 Mortality.1987 Mortality.1988 Mortality.1989 Mortality.1990
## 0.08755594 0.08727443 0.08945642 0.09088208 0.08819132
## Mortality.1991 Mortality.1992 Mortality.1993 Mortality.1994 Mortality.1995
## 0.08853321 0.08496670 0.08968311 0.08458425 0.08674368
## Mortality.1996 Mortality.1997 Mortality.1998 Mortality.1999 Mortality.2000
## 0.08781708 0.08768066 0.08755953 0.08945094 0.08555462
## Mortality.2001 Mortality.2002 Mortality.2003 Mortality.2004 Mortality.2005
## 0.08435991 0.08575500 0.08772736 0.08253538 0.08436764
## Mortality.2006 Mortality.2007 Mortality.2008 Mortality.2009 Mortality.2010
## 0.08100934 0.08207085 0.08487500 0.07927896 0.07943726
## Mortality.2011 Mortality.2012 Mortality.2013 Mortality.2014 Mortality.2015
## 0.07738623 0.08101057 0.08180283 0.07833453 0.08415330
## Mortality.2016 Mortality.2017 Mean
## 0.07963415 0.08171896 0.09233019
Finally, let us try on the individual elements (obviously we will have to use something other than mean here):
## Mortality.1961 Mortality.1962 Mortality.1963 Mortality.1964 Mortality.1965
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE
## Mortality.1966 Mortality.1967 Mortality.1968 Mortality.1969 Mortality.1970
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE
## Mortality.1971 Mortality.1972 Mortality.1973 Mortality.1974 Mortality.1975
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE
## Mortality.1976 Mortality.1977 Mortality.1978 Mortality.1979 Mortality.1980
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE
## Mortality.1981 Mortality.1982 Mortality.1983 Mortality.1984 Mortality.1985
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE
## Mortality.1986 Mortality.1987 Mortality.1988 Mortality.1989 Mortality.1990
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE
## Mortality.1991 Mortality.1992 Mortality.1993 Mortality.1994 Mortality.1995
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE
## Mortality.1996 Mortality.1997 Mortality.1998 Mortality.1999 Mortality.2000
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE
## Mortality.2001 Mortality.2002 Mortality.2003 Mortality.2004 Mortality.2005
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE
## Mortality.2006 Mortality.2007 Mortality.2008 Mortality.2009 Mortality.2010
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE
## Mortality.2011 Mortality.2012 Mortality.2013 Mortality.2014 Mortality.2015
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE
## Mortality.2016 Mortality.2017 Mean
## [1,] FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE
You can use ANY function in the apply() command, including those which have multiple inputs - in this case, the additional inputs of the function are included as additional inputs into the apply() function itself - and custom created functions. As a final example, let us create a function and use it within apply() on the Mortality data:
min_max <- function(x, type){
ordered <- sort(x)
if(type == "min"){
return (ordered[1])
} else if (type == "max"){
return(ordered[length(ordered)])
}
}
apply(Mortality[,-1], 1, min_max, type = "min")## [1] 0.00785 0.00042 0.00029 0.00021 0.00015 0.00014 0.00015 0.00013 0.00010
## [10] 0.00013 0.00010 0.00015 0.00016 0.00017 0.00021 0.00026 0.00033 0.00045
## [19] 0.00060 0.00063 0.00066 0.00066 0.00065 0.00074 0.00070 0.00074 0.00081
## [28] 0.00086 0.00093 0.00095 0.00104 0.00108 0.00116 0.00130 0.00141 0.00151
## [37] 0.00153 0.00176 0.00189 0.00211 0.00231 0.00238 0.00261 0.00291 0.00321
## [46] 0.00352 0.00367 0.00385 0.00427 0.00468 0.00515 0.00565 0.00606 0.00649
## [55] 0.00725 0.00770 0.00852 0.00940 0.01030 0.01130 0.01249 0.01360 0.01495
## [64] 0.01644 0.01786 0.01883 0.02048 0.02189 0.02463 0.02695 0.02853 0.03255
## [73] 0.03683 0.04035 0.04411 0.04971 0.05459 0.05976 0.06545 0.07277 0.08134
## [82] 0.08918 0.09897 0.11005 0.12088 0.13334 0.14586 0.15333 0.16419 0.17963
## [91] 0.19210 0.20916 0.22064 0.25244 0.27843 0.29249 0.32928 0.34647 0.37149
## [100] 0.37686 0.42035 0.44813 0.40351 0.44118 0.46429 0.30769
## [1] 0.04606 0.00316 0.00200 0.00140 0.00126 0.00106 0.00090 0.00084 0.00079
## [10] 0.00074 0.00074 0.00071 0.00070 0.00077 0.00085 0.00101 0.00143 0.00164
## [19] 0.00173 0.00176 0.00171 0.00164 0.00171 0.00161 0.00162 0.00167 0.00171
## [28] 0.00167 0.00170 0.00189 0.00203 0.00201 0.00229 0.00230 0.00251 0.00278
## [37] 0.00294 0.00325 0.00374 0.00407 0.00437 0.00497 0.00561 0.00591 0.00660
## [46] 0.00743 0.00817 0.00927 0.00993 0.01121 0.01223 0.01332 0.01461 0.01628
## [55] 0.01767 0.01912 0.02179 0.02400 0.02639 0.02973 0.03260 0.03507 0.03961
## [64] 0.04175 0.04412 0.04736 0.05014 0.05585 0.06048 0.06440 0.06996 0.07345
## [73] 0.08260 0.09095 0.09703 0.10561 0.11713 0.12549 0.13782 0.15004 0.16189
## [82] 0.17491 0.19869 0.21315 0.23074 0.24395 0.26576 0.28506 0.30330 0.32258
## [91] 0.33400 0.35220 0.38738 0.40812 0.43428 0.44918 0.48300 0.49248 0.53333
## [100] 0.51724 0.60373 0.61491 0.66279 0.89286 0.90476 0.85714
7.4.1 lapply(), sapply() and vapply()
There are some very subtle differences between the different versions of the aplpy family, e.g, lapply(), sapply() and vapply() and their differences relate to how their final output is stored.
lapply()applies a function and always returns a list:sapply()simplifies the structure of the when possiblevapply()is stricter and safer thansapply()and requires you to choose the output structure.
Do not worry too much about these differences at this point, you will encounter them the more you use R. For now, just make sure you understand how the apply() function works.
7.5 Exercises
Exercise 7.3 You will find a CSV file named PolicyHolders.csv on the VLE. This file contains information on a small portfolio of insurance policyholders.
Import the data into R and save it as a data frame called
PolicyHolders.Use
str(),head()andsummary()to inspect the dataset.Identify:
- which variables are numeric,
- which variables are categorical,
- whether any variables appear to have been incorrectly classified.
How many missing values are there in the dataset in total?
How many missing values are there in each column?
Exercise 7.4 Using the PolicyHolders dataset from the previous exercise:
Remove any policyholders with missing ages.
Check whether any ages are below 0 or above 110. How many such observations are there?
Check whether any premium values are negative. If so, remove those observations.
Create a new data frame called
PolicyHolders_Cleancontaining only valid observations.Calculate the mean and standard deviation of premiums for the cleaned dataset using
apply().
Exercise 7.5 You will find an Excel file named StockPrices.xlsx on the VLE which contains the daily closing prices of five common stocks from 1994 - 2015.
Import the ‘Stock Price’ data from the Excel file on the VLE and save it as a data frame.
Create a function and use this within the
apply()function to add new columns to this data frame containing the relative returns of each stock. Label the new columns appropriately.Create a plot which compares the mean relative returns per year for each of the stocks. [Hint: You can extract the year from the date column of the data frame using the following code -
format(date, format = "\%Y").]Using the
apply()function, calculate the standard deviations of the returns for the different stocks within each year.
7.6 Applied exercises
Exercise 7.6 You are given a CSV file named MotorInsurancePortfolio.csv on the VLE. The dataset contains information on a portfolio of motor insurance policies.
The variables within the dataset are:
PolicyID– unique policy identifierAge– age of the policyholderVehicleValue– value of the insured vehicle (£)Exposure– proportion of the year insuredClaims– number of claims reported during the yearClaimCost– total claim cost during the year (£)
Import the dataset into R and inspect its structure using
str(),head()andsummary().Check for missing values in each variable. Remove any observations with missing exposure or claim cost.
Validate the data by checking:
Ageis between 17 and 100,Exposurelies between 0 and 1,ClaimCostis non-negative.
Remove any observations that fail these checks.
- Create a new variable called
ClaimFrequencydefined as:
\[
\text{ClaimFrequency} = \frac{\text{Claims}}{\text{Exposure}}
\]
v. Using the apply() function, calculate:
- the mean claim cost,
- the mean claim frequency,
- the standard deviation of claim costs
for each age group, where age groups are:
- Under 25
- 25–44
- 45–64
- 65+
- Produce a plot comparing mean claim frequency across age groups.