Chapter7 Data in R

Up to this point, we have mostly worked with small objects that we created ourselves inside R. In practice, this is rarely how data analysis happens. As actuaries and financial analysts, you will almost always work with data that comes from external sources:

  • Excel spreadsheets
  • CSV files
  • databases
  • statistical software
  • administrative records
  • messy, incomplete real-world datasets

Unfortunately, real datasets are rarely clean. Common problems that arise include:

  • missing values
  • inconsistent formats
  • incorrect data types
  • duplicated records
  • impossible values (negative ages, probabilities > 1, etc.)
  • mixed numeric and text data

R will often try to “guess” what your data means when importing it. Sometimes it guesses correctly and sometimes it does not! A major skill in actuarial work is learning to inspect and question your data before using it.

This chapter focuses on the full workflow of working with real data, i.e. \[ \textbf{import} \rightarrow \textbf{check} \rightarrow \textbf{clean}\rightarrow \textbf{validate} \rightarrow \textbf{summarise} \] The goal is not just to teach syntax, but to develop good professional habits when handling data.

7.1 Importing data

In most cases when working with data, the observations will have been collected and stored in another programme which is better built for data collection. The two most common formats you will encounter are:

  • CSV (comma-separated values)
  • Excel spreadsheets

CSV files are generally safer and cleaner than Excel, but both are widely used.

Fortunately, R can import such data easily and will save the imported values into a data frame type object (known as a tibble) automatically. It is also possible to save it directly as a data frame, if preferred. At this stage, the differences between tibble and a data frame are not important.

7.1.1 Importing data from a CSV files

CSV files are simple text files where information (observations, variables etc.) are separated by simple commas (hence the name). In their raw form, these files are not appropriate to work with directly. However, their simple structure makes them much easy to handle when importing in various statistical software.

To import a CSV into R, we can simply use the read.csv() function, with the file path used as an input. If your data file exists within your working directory (the file in which R is working directly and searching for files), you need only include the datasets file name:

data <- read.csv("filename.csv")

7.1.2 Importing data from Excel

To import data into R from Excel, use the following steps:

  1. Click the Import Dataset option within the Environment window (Top right)
  2. Click ‘From Excel…’ (You may be asked to install some packages here, if so press ‘Yes’ or ‘Okay’)
  3. Click `Browse’ to enter your files
  4. Choose the Excel file containing the data
  5. Edit the dataset name (if necessary)
  6. Choose which sheet you want to import from the Excel file (if necessary)
  7. Decide if you need to skip any rows due to format
  8. Tick first row as names (if appropriate)
  9. Press “Import”
  10. Re-save as a data frame (if necessary)

In this example I will import four different data sets (Male and Female Deaths and Population in UK) from the ‘UK(Pop&Death).xls’ data set and name them 1) Male_UK_Death, 2) Female_UK_Death, 3) Male_UK_Pop and 4) Female_UK_Pop, respectively.

Obviously I cannot show you the above steps explicitly in these notes, but they are the steps that have been used to load the data here. Below is an example of what the corresponding code that appears in the console when following these steps (you may find that your directory path where you have saved the data is different but that should be the only difference):

library(readxl)
Male_UK_Death <- read_excel("UK(Pop&Death).xls", sheet = "UK male deaths")
Female_UK_Death <- read_excel("UK(Pop&Death).xls", sheet = "UK female deaths")
Male_UK_Pop <- read_excel("UK(Pop&Death).xls", sheet = "UK male pop")
Female_UK_Pop <- read_excel("UK(Pop&Death).xls", sheet = "UK female pop")

7.2 Inspecting the data

Immediately after importing data, you should always inspect it. There are many way to do this, but at the very least I would suggest using the following three essential commands:

str(Male_UK_Death)
## tibble [108 × 58] (S3: tbl_df/tbl/data.frame)
##  $ Age : chr [1:108] "0" "1" "2" "3" ...
##  $ 1961: num [1:108] 12047 785 484 305 261 ...
##  $ 1962: num [1:108] 12707 727 423 300 258 ...
##  $ 1963: num [1:108] 12408 788 437 324 297 ...
##  $ 1964: num [1:108] 11989 684 406 289 262 ...
##  $ 1965: num [1:108] 11325 671 407 354 280 ...
##  $ 1966: num [1:108] 11105 726 471 350 305 ...
##  $ 1967: num [1:108] 10254 649 435 340 266 ...
##  $ 1968: num [1:108] 10319 644 439 381 289 ...
##  $ 1969: num [1:108] 9894 644 453 311 305 ...
##  $ 1970: num [1:108] 9713 590 394 314 258 ...
##  $ 1971: num [1:108] 9366 531 374 279 255 ...
##  $ 1972: num [1:108] 8393 556 357 310 237 ...
##  $ 1973: num [1:108] 7781 523 365 297 255 ...
##  $ 1974: num [1:108] 7180 495 311 299 202 214 189 179 140 136 ...
##  $ 1975: num [1:108] 6392 402 273 255 209 ...
##  $ 1976: num [1:108] 5706 388 243 217 188 ...
##  $ 1977: num [1:108] 5350 314 235 181 188 167 173 131 143 130 ...
##  $ 1978: num [1:108] 5220 338 194 182 152 156 152 141 125 152 ...
##  $ 1979: num [1:108] 5447 265 188 149 146 ...
##  $ 1980: num [1:108] 5174 319 194 131 148 ...
##  $ 1981: num [1:108] 4759 310 194 137 130 ...
##  $ 1982: num [1:108] 4555 335 177 143 105 ...
##  $ 1983: num [1:108] 4230 268 171 137 119 100 97 94 85 93 ...
##  $ 1984: num [1:108] 3995 312 165 155 93 ...
##  $ 1985: num [1:108] 4003 281 186 147 114 ...
##  $ 1986: num [1:108] 4219 255 158 139 101 ...
##  $ 1987: num [1:108] 4105 264 171 116 106 ...
##  $ 1988: num [1:108] 4110 281 168 135 96 88 84 91 96 74 ...
##  $ 1989: num [1:108] 3799 275 182 140 102 ...
##  $ 1990: num [1:108] 3614 273 175 126 100 ...
##  $ 1991: num [1:108] 3377 267 152 132 85 ...
##  $ 1992: num [1:108] 2954 218 135 105 101 ...
##  $ 1993: num [1:108] 2757 223 158 108 91 ...
##  $ 1994: num [1:108] 2644 207 125 88 85 ...
##  $ 1995: num [1:108] 2575 177 115 100 62 ...
##  $ 1996: num [1:108] 2575 192 120 85 87 ...
##  $ 1997: num [1:108] 2414 191 115 81 78 ...
##  $ 1998: num [1:108] 2315 190 120 84 71 ...
##  $ 1999: num [1:108] 2323 190 118 84 67 ...
##  $ 2000: num [1:108] 2136 156 102 69 63 ...
##  $ 2001: num [1:108] 2052 143 91 62 62 ...
##  $ 2002: num [1:108] 2050 160 97 58 67 48 43 43 37 53 ...
##  $ 2003: num [1:108] 2047 133 87 77 59 ...
##  $ 2004: num [1:108] 2033 150 79 70 46 ...
##  $ 2005: num [1:108] 2117 143 90 66 40 ...
##  $ 2006: num [1:108] 2078 146 85 53 44 ...
##  $ 2007: num [1:108] 2113 154 104 70 51 ...
##  $ 2008: num [1:108] 2123 144 70 60 45 ...
##  $ 2009: num [1:108] 2067 127 73 42 34 ...
##  $ 2010: num [1:108] 1915 129 77 61 39 ...
##  $ 2011: num [1:108] 2049 143 81 46 40 ...
##  $ 2012: num [1:108] 1912 134 74 42 46 ...
##  $ 2013: num [1:108] 1741 135 74 45 44 ...
##  $ 2014: num [1:108] 1646 142 72 56 30 ...
##  $ 2015: num [1:108] 1752 119 67 53 49 ...
##  $ 2016: num [1:108] 1654 111 64 51 34 ...
##  $ 2017: num [1:108] 1664 98 58 47 47 ...
head(Male_UK_Death)
## # A tibble: 6 × 58
##   Age   `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968` `1969` `1970`
##   <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 0      12047  12707  12408  11989  11325  11105  10254  10319   9894   9713
## 2 1        785    727    788    684    671    726    649    644    644    590
## 3 2        484    423    437    406    407    471    435    439    453    394
## 4 3        305    300    324    289    354    350    340    381    311    314
## 5 4        261    258    297    262    280    305    266    289    305    258
## 6 5        218    217    235    247    255    258    251    259    265    218
## # ℹ 47 more variables: `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>,
## #   `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>,
## #   `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
## #   `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>,
## #   `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>,
## #   `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>,
## #   `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, `2004` <dbl>, …
summary(Male_UK_Death)
##      Age                 1961             1962               1963         
##  Length:108         Min.   :     0   Min.   :     1.0   Min.   :     1.0  
##  Class :character   1st Qu.:   341   1st Qu.:   325.5   1st Qu.:   345.5  
##  Mode  :character   Median :   990   Median :   983.0   Median :  1031.0  
##                     Mean   :  6019   Mean   :  6089.2   Mean   :  6258.7  
##                     3rd Qu.:  6458   3rd Qu.:  6589.0   3rd Qu.:  6696.5  
##                     Max.   :322028   Max.   :325770.0   Max.   :334841.0  
##                     NA's   :1        NA's   :1          NA's   :1         
##       1964             1965             1966               1967         
##  Min.   :     1   Min.   :     0   Min.   :     1.0   Min.   :     0.0  
##  1st Qu.:   344   1st Qu.:   347   1st Qu.:   346.5   1st Qu.:   342.5  
##  Median :  1009   Median :  1052   Median :   983.0   Median :   922.0  
##  Mean   :  5880   Mean   :  6039   Mean   :  6163.7   Mean   :  5896.5  
##  3rd Qu.:  6227   3rd Qu.:  6321   3rd Qu.:  6352.0   3rd Qu.:  6100.0  
##  Max.   :314573   Max.   :323083   Max.   :329757.0   Max.   :315461.0  
##  NA's   :1        NA's   :1        NA's   :1          NA's   :1         
##       1968               1969             1970               1971         
##  Min.   :     2.0   Min.   :     1   Min.   :     0.0   Min.   :     1.0  
##  1st Qu.:   337.5   1st Qu.:   345   1st Qu.:   343.5   1st Qu.:   342.5  
##  Median :  1025.0   Median :   985   Median :  1035.0   Median :   990.0  
##  Mean   :  6235.4   Mean   :  6314   Mean   :  6249.6   Mean   :  6140.9  
##  3rd Qu.:  6465.5   3rd Qu.:  6301   3rd Qu.:  6205.0   3rd Qu.:  6008.5  
##  Max.   :333594.0   Max.   :337800   Max.   :334355.0   Max.   :328537.0  
##  NA's   :1          NA's   :1        NA's   :1          NA's   :1         
##       1972               1973               1974             1975         
##  Min.   :     1.0   Min.   :     2.0   Min.   :     3   Min.   :     3.0  
##  1st Qu.:   344.5   1st Qu.:   378.5   1st Qu.:   363   1st Qu.:   359.5  
##  Median :   963.0   Median :   996.0   Median :   951   Median :   887.0  
##  Mean   :  6403.8   Mean   :  6332.5   Mean   :  6304   Mean   :  6261.8  
##  3rd Qu.:  6230.5   3rd Qu.:  6207.5   3rd Qu.:  6130   3rd Qu.:  5858.0  
##  Max.   :342605.0   Max.   :338788.0   Max.   :337263   Max.   :335006.0  
##  NA's   :1          NA's   :1          NA's   :1        NA's   :1         
##       1976             1977             1978             1979       
##  Min.   :     4   Min.   :     5   Min.   :     4   Min.   :     5  
##  1st Qu.:   364   1st Qu.:   352   1st Qu.:   365   1st Qu.:   349  
##  Median :   859   Median :   881   Median :   836   Median :   874  
##  Mean   :  6391   Mean   :  6167   Mean   :  6288   Mean   :  6347  
##  3rd Qu.:  5753   3rd Qu.:  5340   3rd Qu.:  5436   3rd Qu.:  5588  
##  Max.   :341910   Max.   :329924   Max.   :336395   Max.   :339568  
##  NA's   :1        NA's   :1        NA's   :1        NA's   :1       
##       1980               1981             1982               1983         
##  Min.   :     3.0   Min.   :     7   Min.   :     8.0   Min.   :     3.0  
##  1st Qu.:   340.5   1st Qu.:   329   1st Qu.:   331.5   1st Qu.:   324.5  
##  Median :   898.0   Median :   869   Median :   793.0   Median :   846.0  
##  Mean   :  6212.5   Mean   :  6152   Mean   :  6167.7   Mean   :  6146.2  
##  3rd Qu.:  5364.5   3rd Qu.:  5518   3rd Qu.:  5514.5   3rd Qu.:  5400.0  
##  Max.   :332370.0   Max.   :329145   Max.   :329971.0   Max.   :328824.0  
##  NA's   :1          NA's   :1        NA's   :1          NA's   :1         
##       1984             1985               1986               1987         
##  Min.   :     3   Min.   :     5.0   Min.   :     6.0   Min.   :     8.0  
##  1st Qu.:   332   1st Qu.:   331.5   1st Qu.:   342.5   1st Qu.:   354.5  
##  Median :   768   Median :   808.0   Median :   868.0   Median :   873.0  
##  Mean   :  6002   Mean   :  6197.4   Mean   :  6115.1   Mean   :  5949.2  
##  3rd Qu.:  5184   3rd Qu.:  5447.5   3rd Qu.:  5522.5   3rd Qu.:  5251.0  
##  Max.   :321095   Max.   :331562.0   Max.   :327159.0   Max.   :318282.0  
##  NA's   :1        NA's   :1          NA's   :1          NA's   :1         
##       1988             1989             1990             1991       
##  Min.   :    14   Min.   :     8   Min.   :    10   Min.   :     9  
##  1st Qu.:   374   1st Qu.:   381   1st Qu.:   396   1st Qu.:   387  
##  Median :   887   Median :   927   Median :   911   Median :  1002  
##  Mean   :  5965   Mean   :  5985   Mean   :  5880   Mean   :  5877  
##  3rd Qu.:  5299   3rd Qu.:  5220   3rd Qu.:  4948   3rd Qu.:  5036  
##  Max.   :319119   Max.   :320193   Max.   :314601   Max.   :314427  
##  NA's   :1        NA's   :1        NA's   :1        NA's   :1       
##       1992             1993             1994               1995       
##  Min.   :    11   Min.   :    12   Min.   :    12.0   Min.   :    18  
##  1st Qu.:   335   1st Qu.:   334   1st Qu.:   307.5   1st Qu.:   316  
##  Median :   885   Median :   901   Median :   896.0   Median :   885  
##  Mean   :  5767   Mean   :  5935   Mean   :  5656.2   Mean   :  5775  
##  3rd Qu.:  5053   3rd Qu.:  5042   3rd Qu.:  4822.0   3rd Qu.:  4916  
##  Max.   :308535   Max.   :317537   Max.   :302607.0   Max.   :308982  
##  NA's   :1        NA's   :1        NA's   :1          NA's   :1       
##       1996             1997             1998               1999       
##  Min.   :    14   Min.   :     8   Min.   :     9.0   Min.   :    10  
##  1st Qu.:   298   1st Qu.:   313   1st Qu.:   321.5   1st Qu.:   299  
##  Median :   872   Median :   908   Median :   935.0   Median :   920  
##  Mean   :  5728   Mean   :  5640   Mean   :  5601.0   Mean   :  5593  
##  3rd Qu.:  4935   3rd Qu.:  4663   3rd Qu.:  4770.0   3rd Qu.:  4722  
##  Max.   :306452   Max.   :301713   Max.   :299655.0   Max.   :299235  
##  NA's   :1        NA's   :1        NA's   :1          NA's   :1       
##       2000               2001               2002               2003         
##  Min.   :    15.0   Min.   :    16.0   Min.   :    12.0   Min.   :    15.0  
##  1st Qu.:   307.5   1st Qu.:   302.5   1st Qu.:   307.5   1st Qu.:   300.5  
##  Median :   910.0   Median :   978.0   Median :  1025.0   Median :   949.0  
##  Mean   :  5445.6   Mean   :  5382.0   Mean   :  5403.4   Mean   :  5405.3  
##  3rd Qu.:  4617.5   3rd Qu.:  4475.0   3rd Qu.:  4553.5   3rd Qu.:  4603.5  
##  Max.   :291337.0   Max.   :287939.0   Max.   :289081.0   Max.   :289185.0  
##  NA's   :1          NA's   :1          NA's   :1          NA's   :1         
##       2004               2005               2006             2007       
##  Min.   :    13.0   Min.   :    19.0   Min.   :    21   Min.   :    22  
##  1st Qu.:   298.5   1st Qu.:   282.5   1st Qu.:   288   1st Qu.:   296  
##  Median :   978.0   Median :  1067.0   Median :  1043   Median :  1064  
##  Mean   :  5213.4   Mean   :  5184.1   Mean   :  5125   Mean   :  5138  
##  3rd Qu.:  4543.5   3rd Qu.:  4389.0   3rd Qu.:  4225   3rd Qu.:  4172  
##  Max.   :278918.0   Max.   :277349.0   Max.   :274201   Max.   :274890  
##  NA's   :1          NA's   :1          NA's   :1        NA's   :1       
##       2008               2009             2010               2011       
##  Min.   :    26.0   Min.   :    26   Min.   :    24.0   Min.   :    30  
##  1st Qu.:   300.5   1st Qu.:   273   1st Qu.:   264.5   1st Qu.:   236  
##  Median :  1137.0   Median :  1113   Median :  1137.0   Median :  1097  
##  Mean   :  5172.8   Mean   :  5062   Mean   :  5064.4   Mean   :  5000  
##  3rd Qu.:  4328.0   3rd Qu.:  4131   3rd Qu.:  4367.5   3rd Qu.:  4557  
##  Max.   :276745.0   Max.   :270804   Max.   :270945.0   Max.   :267491  
##  NA's   :1          NA's   :1        NA's   :1          NA's   :1       
##       2012               2013               2014             2015         
##  Min.   :    27.0   Min.   :    23.0   Min.   :    26   Min.   :    29.0  
##  1st Qu.:   227.5   1st Qu.:   233.5   1st Qu.:   226   1st Qu.:   239.5  
##  Median :  1061.0   Median :  1106.0   Median :  1100   Median :  1069.0  
##  Mean   :  5109.3   Mean   :  5218.1   Mean   :  5205   Mean   :  5471.2  
##  3rd Qu.:  4792.5   3rd Qu.:  4941.0   3rd Qu.:  5018   3rd Qu.:  5326.0  
##  Max.   :273347.0   Max.   :279171.0   Max.   :278455   Max.   :292707.0  
##  NA's   :1          NA's   :1          NA's   :1        NA's   :1         
##       2016               2017         
##  Min.   :    24.0   Min.   :    24.0  
##  1st Qu.:   252.5   1st Qu.:   216.5  
##  Median :  1073.0   Median :  1076.0  
##  Mean   :  5476.6   Mean   :  5585.9  
##  3rd Qu.:  5051.0   3rd Qu.:  5054.0  
##  Max.   :293001.0   Max.   :298843.0  
##  NA's   :1          NA's   :1

These commands answer three critical questions:

  • What structure does the dataset have?
  • What do the first few rows look like?
  • Are the values sensible?

This habit is more important than memorising any function. In addition, it is important to keep in mind that R hasn’t been told what the different types of variables actually are and as such, guesses. However, these are not always correct. It is important to make sure you analsye each variable and, where necessary, re-assign the variable type, e.g., numeric, factor, date/time etc. This is important as many statistical functions behave differently depending on the variable type. A column treated as text instead of numeric may silently break your analysis.

Finally, when inspecting the data, make sure to look out for missing data. This is one of the most important concepts in data analysis. In R, missing values are stored as NA and you must explicitly handle them. To detect NA values we can use is.na() which will return a TRUE/FALSE logical value:

sum(is.na(Male_UK_Death)) # Remember that TRUE/FALSE values are stored as 1 and 0. 
## [1] 58

If your dataset is missing values, these can cause issues in numerous calculations. In this case, we need to handle the NA values beforehand.

One approacho is to simply assign random values to these that are in-line with the rest of the data:

sum(is.na(Male_UK_Death$"1961"))
## [1] 1
mean(Male_UK_Death$"1961",  na.rm = TRUE)
## [1] 6019.215

Another approach is to remove these values from the dataset:

Male_UK.1961 <- Male_UK_Death$"1961"[!is.na(Male_UK_Death$"1961")] 
sum(is.na(Male_UK.1961))
## [1] 0

NOTE: If you do not understand how missing values are handled, you do not understand your result.

7.3 Cleaning and preparing data

In many cases, the data set that we have imported may not be exactly how we want it for our analysis. There are a huge number of things that you may want/need to do to your data to tidy it up or ‘clean’ the data as it is commonly known. For example, deal with missing data (as. above), extract unimportant rows/columns, rename columns, delete outliers, create new variables etc.

As an example, let us have a little look at the data sets we created:

We would like to remove the last two rows of this data set as they don’t contain ‘raw’ data.

Male_UK_Death_New <- Male_UK_Death[-c(107, 108),]
datatable(Male_UK_Death_New, options=list(scrollX = TRUE))

We can do the same with the rest of the data sets as they have same issue:

Female_UK_Death_New <- Female_UK_Death[-c(107, 108),]
Male_UK_Pop_New <- Male_UK_Death[-c(107, 108),]
Female_UK_Pop_New <- Female_UK_Pop[-c(107, 108),]

We have now tidied up our data to be in a format more beneficial to us and we could start to analyse these individually, i.e. create plots, calculate statistics, fit statistical models etc. For example:

Aux <- Male_UK_Death_New[-106,]
summary(lm(Aux$Age ~ Aux$`1961`))
## 
## Call:
## lm(formula = Aux$Age ~ Aux$`1961`)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -81.594 -17.102  -2.754   6.391  62.101 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.189e+01  3.686e+00  11.367  < 2e-16 ***
## Aux$`1961`  3.296e-03  7.947e-04   4.147 6.94e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.33 on 103 degrees of freedom
## Multiple R-squared:  0.1431, Adjusted R-squared:  0.1348 
## F-statistic:  17.2 on 1 and 103 DF,  p-value: 6.937e-05

What if we were not interested in the individual data sets separated by sex and only cared about the overall deaths and population. That is, we want to combine the data sets. In this case, we need to create a new data frame with the relevant data:

Overall_Deaths <- data.frame(Male_UK_Death_New$Age, Male_UK_Death_New[,-1] + Female_UK_Death_New[,-1])
colnames(Overall_Deaths) <- colnames(Male_UK_Death_New)
datatable(Overall_Deaths, options=list(scrollX=TRUE))
Overall_Pop <- data.frame(Male_UK_Pop_New$Age, Male_UK_Pop_New[,-1]+Female_UK_Pop_New[,-1])
colnames(Overall_Pop) <- colnames(Male_UK_Pop_New)
datatable(Overall_Pop, options=list(scrollX=TRUE))
Mortality <- data.frame(Male_UK_Death_New$Age, Mortality = round(Overall_Deaths[,-1]/Overall_Pop[,-1], digits = 5))
datatable(Mortality, options=list(scrollX=TRUE))

Note - We could have calculated the mortality rates on the individual sex data sets to obtain a more accurate estimate of mortality based on sex.

Before we carry on, let us have a look at this data visually now that we have created it:

plot(0:105, Overall_Pop$`1961`, type = "l", main = "UK Population by Age 1961", xlab = "Age", ylab = "Population")

Exercise 7.1 As actuaries, you need to be able to critically analyse data, not just use it. What do you think is the reason for these spikes and falls in the data and why would these spikes be of such importance to us as actuaries?

plot(0:105, Overall_Pop$`1961`, type = "l", main = "UK Population by Age 1961", xlab = "Age", ylab = "Population")
abline(v = 14, col = "red", lty = 2)
abline(v = 40, col = "red", lty = 2)
abline(v = 43, col = "blue", lty = 2)

Let us also look at the overall mortality rates by age:

plot(0:105, Mortality[,2], type = "l", xlab = "Age", ylab = "Mortality Rates", main = "UK Mortality by Age")

for(i in 3:(ncol(Mortality)-2)){
  lines(0:105, Mortality[,i], type = "l", col = i)
}

7.3.1 Validating your data

Cleaning your data is not quite enough. You must also check whether the data you actually have is valid and makes sense. For example, you may want to ask validation questions, such as:

  • Are there negative values where none should exist?
  • Are ages realistic?
  • Are probabilities between 0 and 1?
  • Are there duplicates?
sum(Overall_Deaths$Age < 0)
## [1] 0
range(Overall_Deaths$Age)
## [1] "0"  "99"
duplicated(Overall_Deaths)
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

These checks are simple but powerful. Keep in mind, many modelling errors come from unnoticed data problems. Actuaries must always validate before trusting results.

–> –>

–> –> –> –> –> –> –> –> –>

7.4 The apply family

The final tool I want to talk about in this R module is the family of functions known as the apply() functions. Put simply, the apply() function, along with the its counterparts lapply(), sapply() and vapply() allow us to ‘apply’ a particular function on each row and/or column of a data frame (and other objects) without using loops.

The apply() function is basically a quicker and more convenient version of a for() loop and should always be considered first before loops as they are easier to write, read and are a lot quicker to execute which makes a huge difference when working with larger data sets.

The apply() function takes 3 main inputs (it can take more and we will discuss this soon). The first is the object you want to ‘apply’ the function to, e.g., a data frame. The second is either of the following: 1, 2, or c(1,2) where 1 indicates you want the function to be applied to the rows of the object,2 for the columns of the object and c(1,2) to individual elements. Finally, the third input is the function you want to apply.

As an example, let us look at finding the means for each row of the Mortality data frame created earlier:

apply(Mortality[,-1], 1, mean)
##   [1] 0.0210698246 0.0014724561 0.0008921053 0.0006880702 0.0005698246
##   [6] 0.0005021053 0.0004552632 0.0004131579 0.0003922807 0.0003787719
##  [11] 0.0003708772 0.0003808772 0.0004050877 0.0004466667 0.0005147368
##  [16] 0.0006096491 0.0008249123 0.0011131579 0.0012385965 0.0012468421
##  [21] 0.0012501754 0.0012307018 0.0012184211 0.0012080702 0.0011992982
##  [26] 0.0012268421 0.0012624561 0.0012863158 0.0013300000 0.0013929825
##  [31] 0.0014724561 0.0015310526 0.0016566667 0.0017384211 0.0018745614
##  [36] 0.0020342105 0.0021854386 0.0023766667 0.0025982456 0.0028605263
##  [41] 0.0031626316 0.0034724561 0.0038600000 0.0042473684 0.0046714035
##  [46] 0.0052145614 0.0057436842 0.0063966667 0.0069812281 0.0077947368
##  [51] 0.0086407018 0.0094077193 0.0104771930 0.0115415789 0.0126954386
##  [56] 0.0138889474 0.0153401754 0.0169214035 0.0185714035 0.0204166667
##  [61] 0.0224610526 0.0244631579 0.0269040351 0.0292515789 0.0320694737
##  [66] 0.0347478947 0.0374401754 0.0411029825 0.0443675439 0.0482424561
##  [71] 0.0522892982 0.0562059649 0.0618628070 0.0669026316 0.0726143860
##  [76] 0.0780640351 0.0847368421 0.0907712281 0.0983756140 0.1064989474
##  [81] 0.1156568421 0.1241424561 0.1359135088 0.1472559649 0.1599692982
##  [86] 0.1729105263 0.1879768421 0.2014992982 0.2162587719 0.2339608772
##  [91] 0.2491621053 0.2658356140 0.2889261404 0.3105215789 0.3314791228
##  [96] 0.3535989474 0.3806291228 0.3973307018 0.4223912281 0.4378859649
## [101] 0.4747017544 0.5070512281 0.5217501754 0.5588196491 0.5916705263
## [106] 0.6339664912

Using this simple execution, we could add the mean column to our original data set using the following lines of code:

Mortality <- cbind(Mortality, round(apply(Mortality[,-1], 1, mean), digits = 5))
colnames(Mortality)[ncol(Mortality)] <- "Mean"
datatable(Mortality, options=list(scrollX=TRUE))

Exercise 7.2 With this in mind, can you now add the standard deviation for each age to the end of Mortality, using the apply() function?

Just so you can see how this works let’s look at the same function applied to columns:

apply(Mortality[,-1], 2, mean)
## Mortality.1961 Mortality.1962 Mortality.1963 Mortality.1964 Mortality.1965 
##     0.11996943     0.11418189     0.11808283     0.10151491     0.10242660 
## Mortality.1966 Mortality.1967 Mortality.1968 Mortality.1969 Mortality.1970 
##     0.10994462     0.09871443     0.11012717     0.10943538     0.10920491 
## Mortality.1971 Mortality.1972 Mortality.1973 Mortality.1974 Mortality.1975 
##     0.10152500     0.10327151     0.10089660     0.10494651     0.09937877 
## Mortality.1976 Mortality.1977 Mortality.1978 Mortality.1979 Mortality.1980 
##     0.10209642     0.09486764     0.09701868     0.09346330     0.09503462 
## Mortality.1981 Mortality.1982 Mortality.1983 Mortality.1984 Mortality.1985 
##     0.09516453     0.09313057     0.09219396     0.09105132     0.09378877 
## Mortality.1986 Mortality.1987 Mortality.1988 Mortality.1989 Mortality.1990 
##     0.08755594     0.08727443     0.08945642     0.09088208     0.08819132 
## Mortality.1991 Mortality.1992 Mortality.1993 Mortality.1994 Mortality.1995 
##     0.08853321     0.08496670     0.08968311     0.08458425     0.08674368 
## Mortality.1996 Mortality.1997 Mortality.1998 Mortality.1999 Mortality.2000 
##     0.08781708     0.08768066     0.08755953     0.08945094     0.08555462 
## Mortality.2001 Mortality.2002 Mortality.2003 Mortality.2004 Mortality.2005 
##     0.08435991     0.08575500     0.08772736     0.08253538     0.08436764 
## Mortality.2006 Mortality.2007 Mortality.2008 Mortality.2009 Mortality.2010 
##     0.08100934     0.08207085     0.08487500     0.07927896     0.07943726 
## Mortality.2011 Mortality.2012 Mortality.2013 Mortality.2014 Mortality.2015 
##     0.07738623     0.08101057     0.08180283     0.07833453     0.08415330 
## Mortality.2016 Mortality.2017           Mean 
##     0.07963415     0.08171896     0.09233019

Finally, let us try on the individual elements (obviously we will have to use something other than mean here):

head(apply(Mortality[,-1], c(1,2), is.na))
##      Mortality.1961 Mortality.1962 Mortality.1963 Mortality.1964 Mortality.1965
## [1,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [2,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [3,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [4,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [5,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [6,]          FALSE          FALSE          FALSE          FALSE          FALSE
##      Mortality.1966 Mortality.1967 Mortality.1968 Mortality.1969 Mortality.1970
## [1,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [2,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [3,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [4,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [5,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [6,]          FALSE          FALSE          FALSE          FALSE          FALSE
##      Mortality.1971 Mortality.1972 Mortality.1973 Mortality.1974 Mortality.1975
## [1,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [2,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [3,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [4,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [5,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [6,]          FALSE          FALSE          FALSE          FALSE          FALSE
##      Mortality.1976 Mortality.1977 Mortality.1978 Mortality.1979 Mortality.1980
## [1,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [2,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [3,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [4,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [5,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [6,]          FALSE          FALSE          FALSE          FALSE          FALSE
##      Mortality.1981 Mortality.1982 Mortality.1983 Mortality.1984 Mortality.1985
## [1,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [2,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [3,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [4,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [5,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [6,]          FALSE          FALSE          FALSE          FALSE          FALSE
##      Mortality.1986 Mortality.1987 Mortality.1988 Mortality.1989 Mortality.1990
## [1,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [2,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [3,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [4,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [5,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [6,]          FALSE          FALSE          FALSE          FALSE          FALSE
##      Mortality.1991 Mortality.1992 Mortality.1993 Mortality.1994 Mortality.1995
## [1,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [2,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [3,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [4,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [5,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [6,]          FALSE          FALSE          FALSE          FALSE          FALSE
##      Mortality.1996 Mortality.1997 Mortality.1998 Mortality.1999 Mortality.2000
## [1,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [2,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [3,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [4,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [5,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [6,]          FALSE          FALSE          FALSE          FALSE          FALSE
##      Mortality.2001 Mortality.2002 Mortality.2003 Mortality.2004 Mortality.2005
## [1,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [2,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [3,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [4,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [5,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [6,]          FALSE          FALSE          FALSE          FALSE          FALSE
##      Mortality.2006 Mortality.2007 Mortality.2008 Mortality.2009 Mortality.2010
## [1,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [2,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [3,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [4,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [5,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [6,]          FALSE          FALSE          FALSE          FALSE          FALSE
##      Mortality.2011 Mortality.2012 Mortality.2013 Mortality.2014 Mortality.2015
## [1,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [2,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [3,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [4,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [5,]          FALSE          FALSE          FALSE          FALSE          FALSE
## [6,]          FALSE          FALSE          FALSE          FALSE          FALSE
##      Mortality.2016 Mortality.2017  Mean
## [1,]          FALSE          FALSE FALSE
## [2,]          FALSE          FALSE FALSE
## [3,]          FALSE          FALSE FALSE
## [4,]          FALSE          FALSE FALSE
## [5,]          FALSE          FALSE FALSE
## [6,]          FALSE          FALSE FALSE

You can use ANY function in the apply() command, including those which have multiple inputs - in this case, the additional inputs of the function are included as additional inputs into the apply() function itself - and custom created functions. As a final example, let us create a function and use it within apply() on the Mortality data:

min_max <- function(x, type){
  ordered <- sort(x)
  if(type == "min"){
    return (ordered[1])
  } else if (type == "max"){
    return(ordered[length(ordered)])
  }
}

apply(Mortality[,-1], 1, min_max, type = "min")
##   [1] 0.00785 0.00042 0.00029 0.00021 0.00015 0.00014 0.00015 0.00013 0.00010
##  [10] 0.00013 0.00010 0.00015 0.00016 0.00017 0.00021 0.00026 0.00033 0.00045
##  [19] 0.00060 0.00063 0.00066 0.00066 0.00065 0.00074 0.00070 0.00074 0.00081
##  [28] 0.00086 0.00093 0.00095 0.00104 0.00108 0.00116 0.00130 0.00141 0.00151
##  [37] 0.00153 0.00176 0.00189 0.00211 0.00231 0.00238 0.00261 0.00291 0.00321
##  [46] 0.00352 0.00367 0.00385 0.00427 0.00468 0.00515 0.00565 0.00606 0.00649
##  [55] 0.00725 0.00770 0.00852 0.00940 0.01030 0.01130 0.01249 0.01360 0.01495
##  [64] 0.01644 0.01786 0.01883 0.02048 0.02189 0.02463 0.02695 0.02853 0.03255
##  [73] 0.03683 0.04035 0.04411 0.04971 0.05459 0.05976 0.06545 0.07277 0.08134
##  [82] 0.08918 0.09897 0.11005 0.12088 0.13334 0.14586 0.15333 0.16419 0.17963
##  [91] 0.19210 0.20916 0.22064 0.25244 0.27843 0.29249 0.32928 0.34647 0.37149
## [100] 0.37686 0.42035 0.44813 0.40351 0.44118 0.46429 0.30769
apply(Mortality[,-1], 1, min_max, type = "max")
##   [1] 0.04606 0.00316 0.00200 0.00140 0.00126 0.00106 0.00090 0.00084 0.00079
##  [10] 0.00074 0.00074 0.00071 0.00070 0.00077 0.00085 0.00101 0.00143 0.00164
##  [19] 0.00173 0.00176 0.00171 0.00164 0.00171 0.00161 0.00162 0.00167 0.00171
##  [28] 0.00167 0.00170 0.00189 0.00203 0.00201 0.00229 0.00230 0.00251 0.00278
##  [37] 0.00294 0.00325 0.00374 0.00407 0.00437 0.00497 0.00561 0.00591 0.00660
##  [46] 0.00743 0.00817 0.00927 0.00993 0.01121 0.01223 0.01332 0.01461 0.01628
##  [55] 0.01767 0.01912 0.02179 0.02400 0.02639 0.02973 0.03260 0.03507 0.03961
##  [64] 0.04175 0.04412 0.04736 0.05014 0.05585 0.06048 0.06440 0.06996 0.07345
##  [73] 0.08260 0.09095 0.09703 0.10561 0.11713 0.12549 0.13782 0.15004 0.16189
##  [82] 0.17491 0.19869 0.21315 0.23074 0.24395 0.26576 0.28506 0.30330 0.32258
##  [91] 0.33400 0.35220 0.38738 0.40812 0.43428 0.44918 0.48300 0.49248 0.53333
## [100] 0.51724 0.60373 0.61491 0.66279 0.89286 0.90476 0.85714

7.4.1 lapply(), sapply() and vapply()

There are some very subtle differences between the different versions of the aplpy family, e.g, lapply(), sapply() and vapply() and their differences relate to how their final output is stored.

  • lapply() applies a function and always returns a list:
  • sapply() simplifies the structure of the when possible
  • vapply() is stricter and safer than sapply() and requires you to choose the output structure.

Do not worry too much about these differences at this point, you will encounter them the more you use R. For now, just make sure you understand how the apply() function works.

7.5 Exercises

Exercise 7.3 You will find a CSV file named PolicyHolders.csv on the VLE. This file contains information on a small portfolio of insurance policyholders.

  1. Import the data into R and save it as a data frame called PolicyHolders.

  2. Use str(), head() and summary() to inspect the dataset.

  3. Identify:

  • which variables are numeric,
  • which variables are categorical,
  • whether any variables appear to have been incorrectly classified.
  1. How many missing values are there in the dataset in total?

  2. How many missing values are there in each column?


Exercise 7.4 Using the PolicyHolders dataset from the previous exercise:

  1. Remove any policyholders with missing ages.

  2. Check whether any ages are below 0 or above 110. How many such observations are there?

  3. Check whether any premium values are negative. If so, remove those observations.

  4. Create a new data frame called PolicyHolders_Clean containing only valid observations.

  5. Calculate the mean and standard deviation of premiums for the cleaned dataset using apply().


Exercise 7.5 You will find an Excel file named StockPrices.xlsx on the VLE which contains the daily closing prices of five common stocks from 1994 - 2015.

  1. Import the ‘Stock Price’ data from the Excel file on the VLE and save it as a data frame.

  2. Create a function and use this within the apply() function to add new columns to this data frame containing the relative returns of each stock. Label the new columns appropriately.

  3. Create a plot which compares the mean relative returns per year for each of the stocks. [Hint: You can extract the year from the date column of the data frame using the following code - format(date, format = "\%Y").]

  4. Using the apply() function, calculate the standard deviations of the returns for the different stocks within each year.


7.6 Applied exercises

Exercise 7.6 You are given a CSV file named MotorInsurancePortfolio.csv on the VLE. The dataset contains information on a portfolio of motor insurance policies.

The variables within the dataset are:

  • PolicyID – unique policy identifier
  • Age – age of the policyholder
  • VehicleValue – value of the insured vehicle (£)
  • Exposure – proportion of the year insured
  • Claims – number of claims reported during the year
  • ClaimCost – total claim cost during the year (£)
  1. Import the dataset into R and inspect its structure using str(), head() and summary().

  2. Check for missing values in each variable. Remove any observations with missing exposure or claim cost.

  3. Validate the data by checking:

  • Age is between 17 and 100,
  • Exposure lies between 0 and 1,
  • ClaimCost is non-negative.

Remove any observations that fail these checks.

  1. Create a new variable called ClaimFrequency defined as:

\[ \text{ClaimFrequency} = \frac{\text{Claims}}{\text{Exposure}} \] v. Using the apply() function, calculate:

  • the mean claim cost,
  • the mean claim frequency,
  • the standard deviation of claim costs

for each age group, where age groups are:

  • Under 25
  • 25–44
  • 45–64
  • 65+
  1. Produce a plot comparing mean claim frequency across age groups.