Chapter4 Conditionals and IF Statements

In R, conditional statements or arguments are used to compare or analyse values/data based on certain conditions. In general, this is done with the use of ‘relational operators’ (=, >, <, >=, <=, !=) and ‘logical operators’ (OR, AND, AND/OR).

4.1 Relational operators

The most basic of the ‘relational operators’ is the equality operator (==), which can be used to check if two objects (values, vectors, matrices etc.) are equal:

4 == 3+1
## [1] TRUE
5^2 == 25
## [1] TRUE
8 %% 5 == 3  # The double percentage sign here resembles modulo arithmetic, i.e. 8 mod 5
## [1] TRUE

This can also be performed on vectors on an element by element basis (as usual):

1:10 == c(1,2,3,4,5,6,7,8,9,10)
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
1:10 == c(0,2,3,4,5,6,7,8,9,10)
##  [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Unsurprisingly, it also works on matrices on an element by element basis as well:

matrix(5, nrow = 3, ncol = 3)
##      [,1] [,2] [,3]
## [1,]    5    5    5
## [2,]    5    5    5
## [3,]    5    5    5
matrix(1:9, nrow = 3) == matrix(5, nrow = 3, ncol = 3)
##       [,1]  [,2]  [,3]
## [1,] FALSE FALSE FALSE
## [2,] FALSE  TRUE FALSE
## [3,] FALSE FALSE FALSE
diag(5, nrow = 3, ncol = 3)
##      [,1] [,2] [,3]
## [1,]    5    0    0
## [2,]    0    5    0
## [3,]    0    0    5
diag(5, nrow = 3, ncol = 3) == 5 * diag(1, nrow =3)
##      [,1] [,2] [,3]
## [1,] TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE

Notice that this equality operator uses a double equal sign (==) rather than a single =. This is due to the fact the single equality sign is already used for assignments (similar to <-). This can be confusing, can easily cause errors and is the main reason I always suggest using <- for variable assignment.

Conversely, you can use the not equal operator (!=) in a similar way

3 != 5
## [1] TRUE
seq(1, 10, by = 1) != 1:10
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Note - In general, the (!) symbol negates any type of relational operator or Boolean value in R, e.g.

!TRUE
## [1] FALSE
!FALSE
## [1] TRUE

In a similar way, you should easily be able to understand how the rest of the relational operators work, i.e. (<, >, <=, >=). In the following example(s), I will introduce you to one of the many pre-programmed data sets that form part of the base package data sets, i.e, mtcars; we will discuss data sets in more details in the next few weeks.

mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
(HP <- mtcars$hp)
##  [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52
## [20]  65  97 150 150 245 175  66  91 113 264 175 335 109
HP > 200
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [25] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE

What do you think will happen if we execute the code sum(HP>200) and mean(HP>200)? Have a think about this then check out the solution when you’re ready.

Solution
sum(HP > 200)
## [1] 7
mean(HP > 200)
## [1] 0.21875
In both of these case, the conditional statement(s) have produced a vector of TRUE and FALSE Boolean values. In R, these are understood as being values of 1 and 0 respectively. Hence, it is then possible to take the sum() or the mean() over the Boolean values themselves.

The above gives an examples of how R understands the Boolean values (TRUE/FALSE) as 1 and 0, respectively and also give you an idea of how powerful such simple lines of conditional code can be when used in the right way.

Exercise 4.1 Can you create a vector of all square numbers from 1 to 100 and count how many of these values are divisible by 3? Moreover, can you determine what percentage of them are NOT divisible by 5?

In the next few weeks, we will look in more details at how we can use these relational operators (along with the logical operators discussed below) to conditionally extract data/values from a data.frame. This is a very helpful skill to learn for data handling and manipulation.

4.2 Logical operators

‘Logical operators’ are used to check whether multiple conditions have been satisfied at the same time (AND) or at least one of them (OR). The key to understanding how these work in R, is understanding how logical operators work in theory.

Let us begin with the logical operator ‘AND’ which, in R, is denoted via & or && (I will explain the difference later). For an AND statement/condition to evaluate to TRUE, both conditions in the statement must be TRUE. That is, the condition on the left is TRUE ‘AND’ the condition on the right is TRUE

TRUE & TRUE
## [1] TRUE
TRUE & FALSE
## [1] FALSE
FALSE & FALSE
## [1] FALSE
pi
## [1] 3.141593
pi > 3
## [1] TRUE
pi < 4
## [1] TRUE
pi > 3 & pi < 4
## [1] TRUE
5 < 10 & 5 < 3
## [1] FALSE

It is actually possible to have more than two arguments and include different relational operators as well.What do we think the following expression will evaluate to, TRUE or FALSE?

pi > 0 & pi < 5 & !(pi %% 2 == 0)
## [1] TRUE

As with relational operators, logical operators can also be used in vector form, where the & operator evaluates on a term by term basis, e.g.

c(1,2,3) < c(2,3,4) & c(2,3,4) < c(3,4,5) # Think about this one a little!
## [1] TRUE TRUE TRUE

In fact, this sort of logical/relational operation can also be computed on other objects than just numerical values, i.e. ‘character strings’:

"Red" == "Red"
## [1] TRUE
"Red" == "Blue"
## [1] FALSE
"Red" == "red"
## [1] FALSE
c(1, 2, 3) < c(2, 3, 4) & "Red" == "Blue" # How has this worked? The left hand side is a 3 element vector but the right is a single logical element?
## [1] FALSE FALSE FALSE
c(1, 2, 3) < c(2, 1, 4) & "Red" == "Red"
## [1]  TRUE FALSE  TRUE

In contrast to & which evaluates on a term by term basis, the double && requires single values only! As such, it can catch some common errors.

5 > 1 && 5 < 3
## [1] FALSE

It is good practice to use double && unless you specifically want all elements considered element-wise.

The second logical operator is the so called OR operator, denoted by | and ||, which evaluates to TRUE as long as ‘at least one statement is TRUE’, e.g.

TRUE | TRUE
## [1] TRUE
TRUE | FALSE
## [1] TRUE
FALSE | TRUE
## [1] TRUE
FALSE | FALSE
## [1] FALSE
F | F | T | F #etc.
## [1] TRUE

The same ideas as were discussed above for & work also for |, i.e. | evaluates element-wise, whilst || only works for single values.

Exercise 4.2 With all this in mind, how can we calculate the number of cars in the mtcars data set that have horsepower greater than 200, mpg at most 30, are automatic but do not have 6 cylinders?

Exercise 4.3 The set of data VADeaths contains the death rates (measured per 1000 population per year), in Virginia, USA, in 1940. The structure of this data set is a matrix (not a data frame) with the rows denoting age ranges and the columns sex/area.

  1. How can we find out this information (and possibly more) about the data set?
  2. Extract the two columns containing the female data, either together or separately.
  3. Using conditional arguments, determine how many age groups have a death rate larger than 20 for rural females and a death rate less than 30 from Urban females.

4.3 Conditional extraction

So far in this chapter, we have seen how relational and logical operators return vectors of TRUE and FALSE values. One of the most powerful ideas in R is that these logical vectors can be used directly to extract data.

Recall from earlier chapters that elements of vectors, matrices and data frames can be extracted using square brackets [ ]. If the index inside the brackets is a logical vector of the same length, R will return only those elements corresponding to TRUE values.

4.3.1 Conditional extraction from vectors

Consider the following simple example:

x <- c(3, -1, 5, 0, -4, 7)
x > 0
## [1]  TRUE FALSE  TRUE FALSE FALSE  TRUE

The condition x > 0 produces a logical vector. We can now use this condition directly inside square brackets:

x[x>0]
## [1] 3 5 7

This extracts only the positive values of the vector x. In a similar way, we can extract negative values, values above a threshold, or values satisfying multiple conditions:

x[x<0]
## [1] -1 -4
x[x >= 2 & x <= 6]
## [1] 3 5

This idea avoids the need for loops or IF statements (see below) and is one of the most important techniques for efficient data manipulation in R.

4.3.2 Conditional extraction from data frames

The same principle applies to data frames. Suppose we want to extract only certain rows of a data frame based on conditions applied to one or more columns. Returning to the mtcars data set:

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

We can extract all cars with more than 200 horsepower as follows:

mtcars[mtcars$hp > 200, ]
##                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Duster 360          14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
## Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
## Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
## Ford Pantera L      15.8   8  351 264 4.22 3.170 14.50  0  1    5    4
## Maserati Bora       15.0   8  301 335 3.54 3.570 14.60  0  1    5    8

Here:

  • mtcars$hp > 200 creates a logical vector
  • placing this in the row position extracts only rows where the condition is TRUE
  • leaving the column position empty (,) keeps all variables

We can also combine multiple conditions:

mtcars[mtcars$hp > 200 & mtcars$am == 1, ]
##                 mpg cyl disp  hp drat   wt qsec vs am gear carb
## Ford Pantera L 15.8   8  351 264 4.22 3.17 14.5  0  1    5    4
## Maserati Bora  15.0   8  301 335 3.54 3.57 14.6  0  1    5    8

This extracts cars that:

  • have horsepower greater than 200 and
  • have a manual transmission (am == 1)

Conditional extraction is a key building block for data analysis and will be used extensively in later chapters.

4.4 IF statements

‘IF’ Statements are extremely popular and powerful tools in programming that are used to execute certain commands, based on given conditions. In most cases, the conditions used within IF statements are built up from combinations of the relational and logical operators seen above.

In general, an IF statement has the following form:

if ( condition ){
command
} else {
command
}

To see how an IF statement works in practice, let us look at a simple example to check if a number is odd or even

x <- 8

if (x %% 2 == 0){
  print("This number is even")
} else {
  print("This number is odd")
}
## [1] "This number is even"

You can actually make the output even better in this example by asking it to print out the value of \(x\) that has been given by using the paste function paste(). Notice the variable \(x\) is not in quotation marks but the ‘words’ are.

x <- 14

if (x %% 2 == 0){
  print(paste(x, "is an even number"))
} else {
  print(paste(x,"is an odd number"))
}
## [1] "14 is an even number"

This is quite a simple example but it is very possible to have more complicated and longer IF statements that contain more conditional possibilities. If this is the case, you can simply extend the IF statement by adding elseif instead of just else. Finally, once you have finished with all conditions, you finish with else. For example

x <- 7

if (x < 0) {
  print(paste(x, "is a negative number"))
} else if (x > 0) {
  print(paste(x, "is a positive number"))
} else {
  print(paste(x, "is zero"))
}
## [1] "7 is a positive number"

Exercise 4.4 Can you create an IF statement which tells you whether a number (x) is divisible by another number (y), where both x and y can be changed (not fixed)? Hint: Use the modulus operator %%.

Looking back at the previous two examples regarding even/odd and positive/negative numbers, we can actually combine these two statements by using logical operators within the IF conditions:

x <- 4

if (x < 0 & x %% 2== 0) {
    print(paste(x, "is a negative even number"))
  } else if (x < 0) {
    print(paste(x,"is a negative odd number"))
} else if (x > 0 & x %% 2 == 0) {
    print(paste(x, "is a positive even number"))
  } else if (x > 0){
    print(paste(x, "is a positive odd number"))
} else {
  print(paste(x, "is Zero"))
}
## [1] "4 is a positive even number"

In fact, you could do this an alternative way by ‘nesting’ IF statements inside one another to make several ‘layers’. There is no right or wrong way to do these but through experience you will see either can be used depending on the situation.

x <- 3

if (x < 0) {
  if (x %% 2 == 0){
  print(paste(x, "is a negative even number"))
  } else {
    print(paste(x,"is a negative odd number"))}
} else if (x > 0) {
  if (x %% 2 == 0){
  print(paste(x, "is a positive even number"))
  } else {
    print(paste(x, "is a positive odd number"))
  }
} else {
  print(paste(x, "is Zero"))
}
## [1] "3 is a positive odd number"

What happens if we let \(x\) be a vector?

Note - The IF statement will technically work in the sense it will print something out, but it will not do quite what we expect. This is because in an IF statement, the conditions or ‘test statements’ can only be single elements and thus, R will only consider the first element of the vector. With this in mind, it is important to note that if you use a logical operator in an IF statement, it is always best to use the double version, i.e. && or ||.

That being said, it is possible to bypass such a problem using the ifelse() function. The ifelse() function allows us to create an IF statement which only has one condition but can be applied to a vector element-wise.

x <- c(1, 2, 3)
ifelse(x %% 2 == 0, "Even", "Odd")
## [1] "Odd"  "Even" "Odd"

Note - This only works for quite simple statements.

It is possible to use a more complicated IF statement on a vector as we tried above but to do so we have to introduce the idea of FOR loops, which we will discuss next week!

4.5 Exercises

Exercise 4.5 Create an R script that calculates the square root of a value, x. If the value contained in x is negative it should return NA as output.


Exercise 4.6 Create an R script that returns the maximum value out of the elements of a numeric vector of length 2 (two elements), without using the min, max or sort functions.


Exercise 4.7 Use the command x <- rexp(20, rate = 0.5) to create a vector containing 20 simulations of an exponential random variable with mean 2. Return the number of values that are larger than the mean of the vector x. You are allowed to use the mean() function.


Exercise 4.8 Create a vector containing the integers from 1 to 200.

  1. Extract all values that are divisible by 6.

  2. Extract all values that are divisible by 6 but not divisible by 4.

  3. Calculate the proportion of values in the original vector that satisfy the condition in part (ii).

Hint: Use the modulus operator %% and logical operators.


Exercise 4.9 The built-in data set airquality contains daily air quality measurements in New York.

  1. Use ?airquality to inspect the structure and variables in the data set.

  2. Extract all rows where the temperature (Temp) is above 85 degrees.

  3. From this subset, extract only the ozone (Ozone) values that are not missing (NA).

  4. Compute the mean ozone level for these hot days (excluding missing values).

  5. Briefly comment on what this suggests about the relationship between temperature and ozone levels.

Hint: Missing values can be identified using the function is.na().

4.6 Applied exercises

Exercise 4.10 Before proceeding with this exercise, you need to first generate 1,000 random values which will represent your data in throughout the questions. To do this, use the code yearly.returns <- rbeta(1000, 5, 2) - 0.7.

The values you have generated represent 1000 yearly returns from an asset. Using this data:

  1. Plot a histogram of the yearly returns for this asset.

  2. Calculate the sample mean and sample standard deviation (s.d.) for the yearly returns.

The Sharp Ratio is a measure of risk for a given asset calculated by comparing the mean returns to the risk-free rate of interest. That is, if we denote the mean return from an asset by \(r_A\), the standard deviation by \(\sigma_A\) and the risk-free rate of interest is denoted \(r_f\), then the Sharpe ratio is given by \[SR = \frac{r_A-r_f}{\sigma_A}.\]

  1. Given that the risk-free rate of interest is \(r_f=4\%\), calculate the Sharpe Ratio for this asset. Comment on your result.

  2. Calculate the proportion of positive (gains) and negative (losses) yearly returns, respectively.

  3. Calculate the proportion of yearly returns that are larger than 2 s.d. away from the mean.

  4. Calculate the mean yearly losses. HINT: You can extract elements from vectors/matrices using boolean values, e.g. if x is a 2 element vector, then x[c(TRUE, FALSE)] will extract the first element but not the second.

  5. Calculate the s.d. of the losses (downside risk) of the daily returns. Given your answer in part 2., comment on this result.

The Sortino Ratio is another measure of risk for an asset but only takes into account the downside risk of an investment. That is, if we denote the downside risk (deviation) by \(\sigma_A^-\), then the Sortino ratio is given by \[SorR = \frac{r_A-r_f}{\sigma_A^-}.\]

  1. Given that the risk-free rate of interest \(r_f=4\%\), calculate the Sortino Ratio for this asset. Comment on the difference between this measure and the Sharpe Ratio.