Chapter3 Object Extraction

Object extraction is one of the most important practical skills in R. Almost all real data analysis involves selecting specific elements from vectors, matrices, data frames, or lists. In this chapter, we introduce explicit object extraction using square brackets and related operators. We will consider a more advanced version, known as conditional extraction, in the following chapters.

3.1 Vector extraction

Let us consider the vector of values \(\{0, 0.1, 0.2, \ldots, 10\}\). Now assume that you want to ‘extract’ the first 11 values from this vector, i.e. the values \(0\) to \(1\). To extract values from a vector, you can use square brackets [] immediately after the vector object to inform R which elements you want to extract. Inside the [] brackets you should indicate the position of the elements you want to extract:

x <- seq(from = 0, to = 10, by = 0.1)
x[1:11] # Extracts the first 11 elements
##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
x[c(1,3,5,7,9)] # This extracts the 1st, 3rd, 5th, 7th and 9th elements
## [1] 0.0 0.2 0.4 0.6 0.8

To extract everything except certain elements, we can use a negative index:

x[-(1:11)] # The negative sign means extract everything except the specified elements
##  [1]  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4  2.5
## [16]  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9  4.0
## [31]  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4  5.5
## [46]  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9  7.0
## [61]  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4  8.5
## [76]  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9 10.0

Notice the comments in the above code? This can be done using the hashtag symbol and is a habit I would strongly recommend you start to implement. I have given more information about this in the supplementary chapter (Additional Tips) at the end of these notes.

To give you a little more context to how/where this might be helpful, take a look at the following simple example about with respect to heights of individuals in a given classroom:

Example 3.1 Assume that the height (in cm) of a 80 individuals in a given classroom were measured and recorded in the variable height_data given below:

##  [1] 155.5811 162.0333 168.9776 163.6076 172.7571 180.7491 170.2831 161.2642
##  [9] 161.8752 187.9392 148.7173 152.8617 182.0168 171.4847 209.7244 181.5340
## [17] 156.3352 162.7277 180.8836 204.8227 179.5608 180.2525 195.4984 154.2021
## [25] 166.0309 206.2509 208.2360 163.7382 181.6033 186.9539 164.7755 161.2731
## [33] 165.1258 188.2063 170.5172 175.2336 176.5116 161.0646 185.4073 173.8013
## [41] 170.5533 168.1974 199.7604 150.8012 200.6002 158.4969 162.2994 176.7569
## [49] 174.5441 192.6645 141.5822 162.5169 157.3972 146.4777 159.1934 170.6703
## [57] 183.2897 157.2363 177.2096 167.2124 176.7870 157.4096 178.6680 186.3012
## [65] 165.8819 166.9002 157.1681 159.1650 172.1028 168.0411 187.8667 168.8181
## [73] 161.7007 153.6000 174.1937 171.3589 191.7009 151.1041 159.8565 192.5666

Now assume that we wanted to find out the average height of the 20 smallest individuals in the classroom:

(height_sorted <- sort(height_data))
##  [1] 141.5822 146.4777 148.7173 150.8012 151.1041 152.8617 153.6000 154.2021
##  [9] 155.5811 156.3352 157.1681 157.2363 157.3972 157.4096 158.4969 159.1650
## [17] 159.1934 159.8565 161.0646 161.2642 161.2731 161.7007 161.8752 162.0333
## [25] 162.2994 162.5169 162.7277 163.6076 163.7382 164.7755 165.1258 165.8819
## [33] 166.0309 166.9002 167.2124 168.0411 168.1974 168.8181 168.9776 170.2831
## [41] 170.5172 170.5533 170.6703 171.3589 171.4847 172.1028 172.7571 173.8013
## [49] 174.1937 174.5441 175.2336 176.5116 176.7569 176.7870 177.2096 178.6680
## [57] 179.5608 180.2525 180.7491 180.8836 181.5340 181.6033 182.0168 183.2897
## [65] 185.4073 186.3012 186.9539 187.8667 187.9392 188.2063 191.7009 192.5666
## [73] 192.6645 195.4984 199.7604 200.6002 204.8227 206.2509 208.2360 209.7244
smallest.20 <- height_sorted[1:20]
mean(smallest.20) # Note I could have done all of this in one line. 
## [1] 154.9757

This example illustrates how extraction is often combined with other functions to answer practical questions.

3.2 Matrix extraction

In a similar way to how you we can extract values from vectors, we can extract values from matrices, this is also done with the square brackets []. However, matrices require two indices: one for the specified row(s) and the other for the column(s) which you would like to extract:

A <- matrix(c(3, 6, 4, 2), nrow = 2, byrow = TRUE)
A
##      [,1] [,2]
## [1,]    3    6
## [2,]    4    2

Extracting a single element:

A[1,1]
## [1] 3
A[2,1]
## [1] 4

Extracting multiple rows from a column:

A[c(1,2), 1]
## [1] 3 4

Extracting an entire row or column is done by leaving one index blank:

A[,1] 
## [1] 3 4
A[1,]
## [1] 3 6

3.2.1 Matrices with names rows and columns

In many cases, your matrix will have row and column names to help identify the meaning behind the values. Row and column names can make extraction clearer and less error-prone:

B <- matrix(1:9, nrow = 3)
rownames(B) <- c("r1", "r2", "r3")
colnames(B) <- c("c1", "c2", "c3")
B
##    c1 c2 c3
## r1  1  4  7
## r2  2  5  8
## r3  3  6  9

Extracting using names:

B["r1", "c2"]
## [1] 4
B[, "c3"]
## r1 r2 r3 
##  7  8  9

3.3 Data frame extraction

Data frames are similar to matrices but allow different data types in each column. Extraction rules are therefore slightly richer and extraction via column (variable) names becomes much simpler:

df <- data.frame(
Name = c("Alice", "Bob", "Charlie"),
Age = c(21, 22, 20),
Score = c(78, 85, 90)
)
df
##      Name Age Score
## 1   Alice  21    78
## 2     Bob  22    85
## 3 Charlie  20    90

Dataframe still support extarction visa matrix-style indexing (see above):

df[1, 2] # First row, second column
## [1] 21
df[, 1] # Entire first column
## [1] "Alice"   "Bob"     "Charlie"
df[2, ] # Entire second row
##   Name Age Score
## 2  Bob  22    85

Extraction by column/variable name can now be done in two different way: using the named matrix-style indexing as above but also, using the $ operator whic extracts a single named column and returns it as a vector:

df[, "Age"]
## [1] 21 22 20
score <- df$Score
str(score)
##  num [1:3] 78 85 90

Note: By default, extracting a single column returns a vector. To keep the result as a data frame you can include drop = FALSE:

str(df[ , "Age", drop = FALSE])
## 'data.frame':    3 obs. of  1 variable:
##  $ Age: num  21 22 20

3.4 List extraction

As already discussed, lists are the most flexible object type in R. They can contain elements of different types and structures. As such, extraction is also more flexible:

my_list <- list(
numbers = c(1, 2, 3),
matrix = matrix(1:4, nrow = 2),
info = "Sample list"
)
my_list
## $numbers
## [1] 1 2 3
## 
## $matrix
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## $info
## [1] "Sample list"

For a list, extraction using single [] brackets returns a sub-list:

my_list[1]
## $numbers
## [1] 1 2 3
my_list["numbers"]
## $numbers
## [1] 1 2 3

If you want to explictly extract the contents of a particular list element, we have to use double brackets [[]]:

my_list[[1]]
## [1] 1 2 3
my_list[["numbers"]]
## [1] 1 2 3

For lists, like data frames, having named elements within a list means we can use the $ operator as a shortcut:

my_list$numbers
## [1] 1 2 3
my_list$info
## [1] "Sample list"

3.5 Summary of extraction methods

Extraction is always about structure. Understanding the type of object you are working with tells you which extraction method to use and what kind of object will be returned. Below is a table to help you:

Object Type Extraction Method Description Returned Object
Vector x[i] Extracts element(s) at position(s) i Vector
Matrix A[i, j] Extracts element(s) from row(s) i and column(s) j Vector or Matrix
Matrix A[, j] Extracts entire column j Vector
Matrix A[i, ] Extracts entire row i Vector
Data Frame df[i, j] Extracts rows i and columns j Data frame or vector
Data Frame df[, j] Extracts column j Vector
Data Frame df[, j, drop = FALSE] Extracts column j and preserves data frame structure Data frame
Data Frame df$col Extracts column named col Vector
List lst[i] Extracts element(s) as a sub-list List
List lst[[i]] Extracts the contents of element i Any
List lst$name Extracts named element name Any

3.6 Exercises

Exercise 3.1 The daily closing prices (in GBP) of a stock over 12 trading days were

prices <- c(102.5, 103.1, 104.0, 103.8, 104.5, 105.2, 104.9, 105.6, 106.1, 105.8, 106.4, 107.0)
  1. Copy, paste and run this code into R to create a vector of prices.

  2. Extract the prices from days 1 to 5.

  3. Extract the prices recorded on days 2, 4, 6, and 8.

  4. Extract all prices except those from the first 3 days.

  5. Compute the mean of the last 5 prices using extraction.

  6. Extract the lowest and highest closing price over these 12 days using the sort() function and extraction.


Exercise 3.2 The matrix below contains annual spot interest rates (in %) for three maturities across four years:

rates <- matrix(
  c(1.2, 1.5, 1.8,
    1.4, 1.7, 2.0,
    1.6, 1.9, 2.2,
    1.8, 2.1, 2.4),
  nrow = 4,
  byrow = TRUE
)

colnames(rates) <- c("1Y", "5Y", "10Y")
rownames(rates) <- c("Year1", "Year2", "Year3", "Year4")
rates
##        1Y  5Y 10Y
## Year1 1.2 1.5 1.8
## Year2 1.4 1.7 2.0
## Year3 1.6 1.9 2.2
## Year4 1.8 2.1 2.4
  1. Copy, paste and run the code above to create a matrix of rates

  2. Extract the 5-year rate in Year 2.

  3. Extract all rates for Year 3.

  4. Extract the full column corresponding to the 10-year rate.

  5. Compute the average rate for each maturity using extraction.


Exercise 3.3 An insurance company records information on a small portfolio of policies:

portfolio <- data.frame(
  PolicyID = 1:6,
  Age = c(34, 45, 29, 52, 41, 37),
  SumAssured = c(100000, 150000, 80000, 200000, 120000, 110000),
  Premium = c(520, 780, 430, 1100, 640, 590)
)

portfolio
##   PolicyID Age SumAssured Premium
## 1        1  34     100000     520
## 2        2  45     150000     780
## 3        3  29      80000     430
## 4        4  52     200000    1100
## 5        5  41     120000     640
## 6        6  37     110000     590
  1. Copy, paste and run the code above to create a data frame for this insurance portfolio

  2. Extract the Age column using two different methods.

  3. Extract the data corresponding to the third policyholder.

  4. Extract the SumAssured AND Premium columns as a data frame.

  5. Compute the variance of the premiums using extraction.


Exercise 3.4 A pension fund summary is stored below:

fund <- list(
  returns = c(0.04, 0.06, 0.02, 0.05, 0.03),
  weights = c(0.3, 0.25, 0.2, 0.15, 0.1),
  assets = c("Equities", "Bonds", "Property", "Infrastructure", "Cash")
)
fund
## $returns
## [1] 0.04 0.06 0.02 0.05 0.03
## 
## $weights
## [1] 0.30 0.25 0.20 0.15 0.10
## 
## $assets
## [1] "Equities"       "Bonds"          "Property"       "Infrastructure"
## [5] "Cash"
  1. Copy, paste and run the code above to create a list containing the summary for this pension fund

  2. Extract the returns vector.

  3. Extract the asset names.

  4. Extract the first three portfolio weights.

  5. Compute the weighted return of the first three assets using extraction and vectorised multiplication.

3.7 Applied exercises

Exercise 3.5 The table below represents the daily log-returns from four different assets: Google, Apple, Sony and Samsung

Google Apple Sony Samsung
0.007 0.004 0.019 0.015
0.017 0.005 -0.014 -0.019
-0.011 -0.005 0.019 -0.005
-0.003 0.011 -0.009 -0.008
0.024 0.004 -0.006 -0.016
0.025 0.001 0.004 -0.007
-0.004 -0.001 0.002 0.001
  1. Create 4 separate vectors containing the daily log-returns for each asset.

  2. Calculate the corresponding (true) daily returns of each asset and save these as new vectors.

  3. Plot the daily returns of all 4 assets on the same plot. Add a title, axis labels and plot each asset in different colours. You may have to manually set the y-axis limits. Try it without and see why.

  4. Create a vector string containing the 4 asset names: Google, Apple, Sony and Samsung.

  5. Create a matrix of the real daily returns, with each column corresponding to a different asset and include the asset names at the top of each column.

  6. Assume you have a portfolio consisting of 20% Google, 30% Apple, 40% Sony and 10% Samsung assets. Using matrices, calculate the daily returns of your portfolio.

  7. Add the portfolio returns to your previous plot. Make this a different colour and use a different line type. Comment on your findings.

  8. Finally, extract the first row of the matrix corresponding to the returns from day 1. Calculate the min and max returns on this day.