Chapter4 Conditionals and IF Statements
In R, conditional statements or arguments are used to compare or analyse values/data based on certain conditions. In general, this is done with the use of ‘relational operators’ (=, >, <, >=, <=, !=) and ‘logical operators’ (OR, AND, AND/OR).
4.1 Relational operators
The most basic of the ‘relational operators’ is the equality operator (==), which can be used to check if two objects (values, vectors, matrices etc.) are equal:
## [1] TRUE
## [1] TRUE
## [1] TRUE
This can also be performed on vectors on an element by element basis (as usual):
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Unsurprisingly, it also works on matrices on an element by element basis as well:
## [,1] [,2] [,3]
## [1,] 5 5 5
## [2,] 5 5 5
## [3,] 5 5 5
## [,1] [,2] [,3]
## [1,] FALSE FALSE FALSE
## [2,] FALSE TRUE FALSE
## [3,] FALSE FALSE FALSE
## [,1] [,2] [,3]
## [1,] 5 0 0
## [2,] 0 5 0
## [3,] 0 0 5
## [,1] [,2] [,3]
## [1,] TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE
Notice that this equality operator uses a double equal sign (==) rather than a single =. This is due to the fact the single equality sign is already used for assignments (similar to <-). This can be confusing, can easily cause errors and is the main reason I always suggest using <- for variable assignment.
Conversely, you can use the not equal operator (!=) in a similar way
## [1] TRUE
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Note - In general, the (!) symbol negates any type of relational operator or Boolean value in R, e.g.
## [1] FALSE
## [1] TRUE
In a similar way, you should easily be able to understand how the rest of the relational operators work, i.e. (<, >, <=, >=). In the following example(s), I will introduce you to one of the many pre-programmed data sets that form part of the base package data sets, i.e, mtcars; we will discuss data sets in more details in the next few weeks.
| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
| Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
| Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
| Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
| Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
| Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
| Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
| Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
| Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
| Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
| Merc 280C | 17.8 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.90 | 1 | 0 | 4 | 4 |
| Merc 450SE | 16.4 | 8 | 275.8 | 180 | 3.07 | 4.070 | 17.40 | 0 | 0 | 3 | 3 |
| Merc 450SL | 17.3 | 8 | 275.8 | 180 | 3.07 | 3.730 | 17.60 | 0 | 0 | 3 | 3 |
| Merc 450SLC | 15.2 | 8 | 275.8 | 180 | 3.07 | 3.780 | 18.00 | 0 | 0 | 3 | 3 |
| Cadillac Fleetwood | 10.4 | 8 | 472.0 | 205 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 | 4 |
| Lincoln Continental | 10.4 | 8 | 460.0 | 215 | 3.00 | 5.424 | 17.82 | 0 | 0 | 3 | 4 |
| Chrysler Imperial | 14.7 | 8 | 440.0 | 230 | 3.23 | 5.345 | 17.42 | 0 | 0 | 3 | 4 |
| Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
| Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
| Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
| Toyota Corona | 21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 |
| Dodge Challenger | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 |
| AMC Javelin | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 |
| Camaro Z28 | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 |
| Pontiac Firebird | 19.2 | 8 | 400.0 | 175 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 | 2 |
| Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 |
| Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
| Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
| Ford Pantera L | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.50 | 0 | 1 | 5 | 4 |
| Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 | 6 |
| Maserati Bora | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
| Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
## [1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52
## [20] 65 97 150 150 245 175 66 91 113 264 175 335 109
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [25] FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
What do you think will happen if we execute the code sum(HP>200) and mean(HP>200)? Have a think about this then check out the solution when you’re ready.
Solution
## [1] 7
## [1] 0.21875
In both of these case, the conditional statement(s) have produced a vector of TRUE and FALSE Boolean values. In R, these are understood as being values of 1 and 0 respectively. Hence, it is then possible to take the sum() or the mean() over the Boolean values themselves. The above gives an examples of how R understands the Boolean values (TRUE/FALSE) as 1 and 0, respectively and also give you an idea of how powerful such simple lines of conditional code can be when used in the right way.
Exercise 4.1 Can you create a vector of all square numbers from 1 to 100 and count how many of these values are divisible by 3? Moreover, can you determine what percentage of them are NOT divisible by 5?
In the next few weeks, we will look in more details at how we can use these relational operators (along with the logical operators discussed below) to conditionally extract data/values from a data.frame. This is a very helpful skill to learn for data handling and manipulation.
4.2 Logical operators
‘Logical operators’ are used to check whether multiple conditions have been satisfied at the same time (AND) or at least one of them (OR). The key to understanding how these work in R, is understanding how logical operators work in theory.
Let us begin with the logical operator ‘AND’ which, in R, is denoted via & or && (I will explain the difference later). For an AND statement/condition to evaluate to TRUE, both conditions in the statement must be TRUE. That is, the condition on the left is TRUE ‘AND’ the condition on the right is TRUE
## [1] TRUE
## [1] FALSE
## [1] FALSE
## [1] 3.141593
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] FALSE
It is actually possible to have more than two arguments and include different relational operators as well.What do we think the following expression will evaluate to, TRUE or FALSE?
## [1] TRUE
As with relational operators, logical operators can also be used in vector form, where the & operator evaluates on a term by term basis, e.g.
## [1] TRUE TRUE TRUE
In fact, this sort of logical/relational operation can also be computed on other objects than just numerical values, i.e. ‘character strings’:
## [1] TRUE
## [1] FALSE
## [1] FALSE
c(1, 2, 3) < c(2, 3, 4) & "Red" == "Blue" # How has this worked? The left hand side is a 3 element vector but the right is a single logical element?## [1] FALSE FALSE FALSE
## [1] TRUE FALSE TRUE
In contrast to & which evaluates on a term by term basis, the double && requires single values only! As such, it can catch some common errors.
## [1] FALSE
It is good practice to use double && unless you specifically want all elements considered element-wise.
The second logical operator is the so called OR operator, denoted by | and ||, which evaluates to TRUE as long as ‘at least one statement is TRUE’, e.g.
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] FALSE
## [1] TRUE
The same ideas as were discussed above for & work also for |, i.e. | evaluates element-wise, whilst || only works for single values.
Exercise 4.2 With all this in mind, how can we calculate the number of cars in the mtcars data set that have horsepower greater than 200, mpg at most 30, are automatic but do not have 6 cylinders?
Exercise 4.3 The set of data VADeaths contains the death rates (measured per 1000 population per year), in Virginia, USA, in 1940. The structure of this data set is a matrix (not a data frame) with the rows denoting age ranges and the columns sex/area.
- How can we find out this information (and possibly more) about the data set?
- Extract the two columns containing the female data, either together or separately.
- Using conditional arguments, determine how many age groups have a death rate larger than 20 for rural females and a death rate less than 30 from Urban females.
4.3 Conditional extraction
So far in this chapter, we have seen how relational and logical operators return vectors of TRUE and FALSE values. One of the most powerful ideas in R is that these logical vectors can be used directly to extract data.
Recall from earlier chapters that elements of vectors, matrices and data frames can be extracted using square brackets [ ]. If the index inside the brackets is a logical vector of the same length, R will return only those elements corresponding to TRUE values.
4.3.1 Conditional extraction from vectors
Consider the following simple example:
## [1] TRUE FALSE TRUE FALSE FALSE TRUE
The condition x > 0 produces a logical vector. We can now use this condition directly inside square brackets:
## [1] 3 5 7
This extracts only the positive values of the vector x. In a similar way, we can extract negative values, values above a threshold, or values satisfying multiple conditions:
## [1] -1 -4
## [1] 3 5
This idea avoids the need for loops or IF statements (see below) and is one of the most important techniques for efficient data manipulation in R.
4.3.2 Conditional extraction from data frames
The same principle applies to data frames. Suppose we want to extract only certain rows of a data frame based on conditions applied to one or more columns. Returning to the mtcars data set:
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
We can extract all cars with more than 200 horsepower as follows:
## mpg cyl disp hp drat wt qsec vs am gear carb
## Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
## Ford Pantera L 15.8 8 351 264 4.22 3.170 14.50 0 1 5 4
## Maserati Bora 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8
Here:
mtcars$hp > 200creates a logical vector- placing this in the row position extracts only rows where the condition is TRUE
- leaving the column position empty (
,) keeps all variables
We can also combine multiple conditions:
## mpg cyl disp hp drat wt qsec vs am gear carb
## Ford Pantera L 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
## Maserati Bora 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
This extracts cars that:
- have horsepower greater than 200 and
- have a manual transmission (
am == 1)
Conditional extraction is a key building block for data analysis and will be used extensively in later chapters.
4.4 IF statements
‘IF’ Statements are extremely popular and powerful tools in programming that are used to execute certain commands, based on given conditions. In most cases, the conditions used within IF statements are built up from combinations of the relational and logical operators seen above.
In general, an IF statement has the following form:
if ( condition ){
command
} else {
command
}
To see how an IF statement works in practice, let us look at a simple example to check if a number is odd or even
## [1] "This number is even"
You can actually make the output even better in this example by asking it to print out the value of \(x\) that has been given by using the paste function paste(). Notice the variable \(x\) is not in quotation marks but the ‘words’ are.
x <- 14
if (x %% 2 == 0){
print(paste(x, "is an even number"))
} else {
print(paste(x,"is an odd number"))
}## [1] "14 is an even number"
This is quite a simple example but it is very possible to have more complicated and longer IF statements that contain more conditional possibilities. If this is the case, you can simply extend the IF statement by adding elseif instead of just else. Finally, once you have finished with all conditions, you finish with else. For example
x <- 7
if (x < 0) {
print(paste(x, "is a negative number"))
} else if (x > 0) {
print(paste(x, "is a positive number"))
} else {
print(paste(x, "is zero"))
}## [1] "7 is a positive number"
Exercise 4.4 Can you create an IF statement which tells you whether a number (x) is divisible by another number (y), where both x and y can be changed (not fixed)? Hint: Use the modulus operator %%.
Looking back at the previous two examples regarding even/odd and positive/negative numbers, we can actually combine these two statements by using logical operators within the IF conditions:
x <- 4
if (x < 0 & x %% 2== 0) {
print(paste(x, "is a negative even number"))
} else if (x < 0) {
print(paste(x,"is a negative odd number"))
} else if (x > 0 & x %% 2 == 0) {
print(paste(x, "is a positive even number"))
} else if (x > 0){
print(paste(x, "is a positive odd number"))
} else {
print(paste(x, "is Zero"))
}## [1] "4 is a positive even number"
In fact, you could do this an alternative way by ‘nesting’ IF statements inside one another to make several ‘layers’. There is no right or wrong way to do these but through experience you will see either can be used depending on the situation.
x <- 3
if (x < 0) {
if (x %% 2 == 0){
print(paste(x, "is a negative even number"))
} else {
print(paste(x,"is a negative odd number"))}
} else if (x > 0) {
if (x %% 2 == 0){
print(paste(x, "is a positive even number"))
} else {
print(paste(x, "is a positive odd number"))
}
} else {
print(paste(x, "is Zero"))
}## [1] "3 is a positive odd number"
What happens if we let \(x\) be a vector?
Note - The IF statement will technically work in the sense it will print something out, but it will not do quite what we expect. This is because in an IF statement, the conditions or ‘test statements’ can only be single elements and thus, R will only consider the first element of the vector. With this in mind, it is important to note that if you use a logical operator in an IF statement, it is always best to use the double version, i.e. && or ||.
That being said, it is possible to bypass such a problem using the ifelse() function. The ifelse() function allows us to create an IF statement which only has one condition but can be applied to a vector element-wise.
## [1] "Odd" "Even" "Odd"
Note - This only works for quite simple statements.
It is possible to use a more complicated IF statement on a vector as we tried above but to do so we have to introduce the idea of FOR loops, which we will discuss next week!
4.5 Exercises
Exercise 4.5 Create an R script that calculates the square root of a value, x. If the value contained in x is negative it should return NA as output.
Exercise 4.6 Create an R script that returns the maximum value out of the elements of a numeric vector of length 2 (two elements), without using the min, max or sort functions.
Exercise 4.7 Use the command x <- rexp(20, rate = 0.5) to create a vector containing 20 simulations of an exponential random variable with mean 2. Return the number of values that are larger than the mean of the vector x. You are allowed to use the mean() function.
Exercise 4.8 Create a vector containing the integers from 1 to 200.
Extract all values that are divisible by 6.
Extract all values that are divisible by 6 but not divisible by 4.
Calculate the proportion of values in the original vector that satisfy the condition in part (ii).
Hint: Use the modulus operator %% and logical operators.
Exercise 4.9 The built-in data set airquality contains daily air quality measurements in New York.
Use
?airqualityto inspect the structure and variables in the data set.Extract all rows where the temperature (Temp) is above 85 degrees.
From this subset, extract only the ozone (Ozone) values that are not missing (NA).
Compute the mean ozone level for these hot days (excluding missing values).
Briefly comment on what this suggests about the relationship between temperature and ozone levels.
Hint: Missing values can be identified using the function is.na().
4.6 Applied exercises
Exercise 4.10 Before proceeding with this exercise, you need to first generate 1,000 random values which will represent your data in throughout the questions. To do this, use the code yearly.returns <- rbeta(1000, 5, 2) - 0.7.
The values you have generated represent 1000 yearly returns from an asset. Using this data:
Plot a histogram of the yearly returns for this asset.
Calculate the sample mean and sample standard deviation (s.d.) for the yearly returns.
The Sharp Ratio is a measure of risk for a given asset calculated by comparing the mean returns to the risk-free rate of interest. That is, if we denote the mean return from an asset by \(r_A\), the standard deviation by \(\sigma_A\) and the risk-free rate of interest is denoted \(r_f\), then the Sharpe ratio is given by \[SR = \frac{r_A-r_f}{\sigma_A}.\]
Given that the risk-free rate of interest is \(r_f=4\%\), calculate the Sharpe Ratio for this asset. Comment on your result.
Calculate the proportion of positive (gains) and negative (losses) yearly returns, respectively.
Calculate the proportion of yearly returns that are larger than 2 s.d. away from the mean.
Calculate the mean yearly losses. HINT: You can extract elements from vectors/matrices using boolean values, e.g. if
xis a 2 element vector, thenx[c(TRUE, FALSE)]will extract the first element but not the second.Calculate the s.d. of the losses (downside risk) of the daily returns. Given your answer in part 2., comment on this result.
The Sortino Ratio is another measure of risk for an asset but only takes into account the downside risk of an investment. That is, if we denote the downside risk (deviation) by \(\sigma_A^-\), then the Sortino ratio is given by \[SorR = \frac{r_A-r_f}{\sigma_A^-}.\]
- Given that the risk-free rate of interest \(r_f=4\%\), calculate the Sortino Ratio for this asset. Comment on the difference between this measure and the Sharpe Ratio.