10 Transform: Logical Vectors and Numbers

What are the types of variables we see in data frames, and what are the different tools we can use to work with them?

10.1 Learning Objectives

By the end of this chapter, you should be able to:

Understand how logical vectors work in R
Use logical conditions to filter and manipulate data

10.2 Logical Vectors

Logical vectors contain only TRUE, FALSE, or NA.

x <- c(TRUE, FALSE, TRUE, NA)
x

[1]  TRUE FALSE  TRUE    NA

10.2.1 Logical comparisons create logical vectors:

nums <- c(2, 5, 8, 1)
nums > 4

[1] FALSE  TRUE  TRUE FALSE

You can use these directly with functions like sum() and mean():

sum(nums > 4)   # Count how many values are > 4

[1] 2

mean(nums > 4)  # Proportion of values > 4

[1] 0.5

10.2.2 In-Class Exercise 1 – Logical Conditions

Create a numeric vector with 10 random values.
How many values are greater than the mean?
What proportion is above the mean?

# use sample() to create a random list:
rands <- sample(x = c(1:10), size = 10)
sum(rands > mean(rands)) # number of values above the mean

[1] 5

mean(rands > mean(rands)) # proportion of values above the mean

[1] 0.5

10.3 Logical Operations

Combine logical vectors with & (and), | (or), and ! (not):

a <- c(TRUE, FALSE, TRUE)
b <- c(TRUE, TRUE, FALSE)

a & b # and

[1]  TRUE FALSE FALSE

a | b # or

[1] TRUE TRUE TRUE

!a # not

[1] FALSE  TRUE FALSE

10.3.1 In-Class Exercise 2 – Combining Conditions

Using the mpg dataset, create a logical condition for cars with hwy > 30 and cyl == 4.
How many such cars exist?

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

mpg |>
  filter(hwy > 30 & cyl == 4) |>
  nrow()

[1] 22

10.4 Comparisons

The following filter (using filter) finds all daytime departures that arrive roughly on time:

library(nycflights13)
flights |> 
  filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)

# A tibble: 172,286 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      601            600         1      844            850
 2  2013     1     1      602            610        -8      812            820
 3  2013     1     1      602            605        -3      821            805
 4  2013     1     1      606            610        -4      858            910
 5  2013     1     1      606            610        -4      837            845
 6  2013     1     1      607            607         0      858            915
 7  2013     1     1      611            600        11      945            931
 8  2013     1     1      613            610         3      925            921
 9  2013     1     1      615            615         0      833            842
10  2013     1     1      622            630        -8     1017           1014
# ℹ 172,276 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

The underlying logical variables can be made visible with mutate():

flights |> 
  mutate(
    daytime = dep_time > 600 & dep_time < 2000,
    approx_ontime = abs(arr_delay) < 20,
    .keep = "used"
  )

# A tibble: 336,776 × 4
   dep_time arr_delay daytime approx_ontime
      <int>     <dbl> <lgl>   <lgl>        
 1      517        11 FALSE   TRUE         
 2      533        20 FALSE   FALSE        
 3      542        33 FALSE   FALSE        
 4      544       -18 FALSE   TRUE         
 5      554       -25 FALSE   FALSE        
 6      554        12 FALSE   TRUE         
 7      555        19 FALSE   TRUE         
 8      557       -14 FALSE   TRUE         
 9      557        -8 FALSE   TRUE         
10      558         8 FALSE   TRUE         
# ℹ 336,766 more rows

Which really means that the first filter is equivalent to:

flights |> 
  mutate(
    daytime = dep_time > 600 & dep_time < 2000,
    approx_ontime = abs(arr_delay) < 20,
  ) |> 
  filter(daytime & approx_ontime)

# A tibble: 172,286 × 21
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      601            600         1      844            850
 2  2013     1     1      602            610        -8      812            820
 3  2013     1     1      602            605        -3      821            805
 4  2013     1     1      606            610        -4      858            910
 5  2013     1     1      606            610        -4      837            845
 6  2013     1     1      607            607         0      858            915
 7  2013     1     1      611            600        11      945            931
 8  2013     1     1      613            610         3      925            921
 9  2013     1     1      615            615         0      833            842
10  2013     1     1      622            630        -8     1017           1014
# ℹ 172,276 more rows
# ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, daytime <lgl>,
#   approx_ontime <lgl>

10.4.1 Floating point comparison

What happens when you use == with numbers? Check it out:

What are the outcomes of these two equations?

x <- c(1 / 49 * 49, sqrt(2) ^ 2)
x

[1] 1 2

Looks like 1 and 2.

Now look:

x == c(1, 2)

[1] FALSE FALSE

You get all FALSE. This is because there’s no way to exactly represent 1/49 or sqrt(2) without ROUNDING:

print(x, digits = 16)

[1] 0.9999999999999999 2.0000000000000004

So == is dangerous with numbers. But, you can use dplyr::near:

near(x, c(1, 2))

[1] TRUE TRUE

10.4.2 Missing values

Missing values pop up easily in R because unknown values are always evaluated that way. If:

NA > 5

[1] NA

Then the following should be true:

10 == NA

[1] NA

Same here:

NA == NA

[1] NA

Print the flights where dep_time is missing. Well, you might want to do this:

flights |>
  filter(dep_time == NA)

# A tibble: 0 × 19
# ℹ 19 variables: year <int>, month <int>, day <int>, dep_time <int>,
#   sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

But it gives NA for every row, so you get nothing. This is where is.na() comes in handy:

flights |> 
  filter(is.na(dep_time))

# A tibble: 8,255 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1       NA           1630        NA       NA           1815
 2  2013     1     1       NA           1935        NA       NA           2240
 3  2013     1     1       NA           1500        NA       NA           1825
 4  2013     1     1       NA            600        NA       NA            901
 5  2013     1     2       NA           1540        NA       NA           1747
 6  2013     1     2       NA           1620        NA       NA           1746
 7  2013     1     2       NA           1355        NA       NA           1459
 8  2013     1     2       NA           1420        NA       NA           1644
 9  2013     1     2       NA           1321        NA       NA           1536
10  2013     1     2       NA           1545        NA       NA           1910
# ℹ 8,245 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

10.5 Boolean algebra

You can combine logical vectors using Boolean algebra.

For instance, this will find all rows where x is not missing.

df |>
  filter(!is.na(x))

This will find all rows where x is smaller than -10 or bigger than 0.

df |>
  filter(x < -10 | x > 0)

10.5.1 Missing values

df <- tibble(x = c(TRUE, FALSE, NA))

df |>
  mutate(
    and = x & NA,
    or = x | NA
  )

# A tibble: 3 × 3
  x     and   or   
  <lgl> <lgl> <lgl>
1 TRUE  NA    TRUE 
2 FALSE FALSE NA   
3 NA    NA    NA

This is based on “TRUE and NA” evaluates to NA while “TRUE or NA” evaluates to TRUE (because at least one of them is TRUE).

10.5.2 Order of operations and `%in%`

Note the difference between these two outputs:

flights |> 
   filter(month == 11 | month == 12)

# A tibble: 55,403 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    11     1        5           2359         6      352            345
 2  2013    11     1       35           2250       105      123           2356
 3  2013    11     1      455            500        -5      641            651
 4  2013    11     1      539            545        -6      856            827
 5  2013    11     1      542            545        -3      831            855
 6  2013    11     1      549            600       -11      912            923
 7  2013    11     1      550            600       -10      705            659
 8  2013    11     1      554            600        -6      659            701
 9  2013    11     1      554            600        -6      826            827
10  2013    11     1      554            600        -6      749            751
# ℹ 55,393 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

flights |> 
   filter(month == 11 | 12)

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

The second one doesn’t work properly. It doesn’t evaluate as “Find all flights that departed in November or December.”

This is because month == 11 creates a logical vector. This vector is then compared to 12, which evaluates to TRUE - so it prints the whole table. Here is an illustration:

flights |> 
  mutate(
    nov = month == 11,
    final = nov | 12,
    .keep = "used"
  )

# A tibble: 336,776 × 3
   month nov   final
   <int> <lgl> <lgl>
 1     1 FALSE TRUE 
 2     1 FALSE TRUE 
 3     1 FALSE TRUE 
 4     1 FALSE TRUE 
 5     1 FALSE TRUE 
 6     1 FALSE TRUE 
 7     1 FALSE TRUE 
 8     1 FALSE TRUE 
 9     1 FALSE TRUE 
10     1 FALSE TRUE 
# ℹ 336,766 more rows

This is where %in% comes in. x %in% y outputs a logical vector the same length as x that is TRUE whenever a value in x is in y.

1:12 %in% c(1,5,11)

 [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE

letters[1:10] %in% c("a", "e", "i", "o", "u")

 [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE

10.5.3 In-Class Exercise

Find all the flights in November and December with %in%

flights |>
  filter(month %in% c(11,12))

# A tibble: 55,403 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    11     1        5           2359         6      352            345
 2  2013    11     1       35           2250       105      123           2356
 3  2013    11     1      455            500        -5      641            651
 4  2013    11     1      539            545        -6      856            827
 5  2013    11     1      542            545        -3      831            855
 6  2013    11     1      549            600       -11      912            923
 7  2013    11     1      550            600       -10      705            659
 8  2013    11     1      554            600        -6      659            701
 9  2013    11     1      554            600        -6      826            827
10  2013    11     1      554            600        -6      749            751
# ℹ 55,393 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

10.6 In-Class Challenge

Find all flights where arr_delay is missing but dep_delay is not.
Find all flights where neither arr_time nor sched_arr_time are missing, but arr_delay is.

10.7 Summaries

10.7.1 Logical summaries `any()` and `all()`

any() returns TRUE if there are any TRUEs in x. (same as |)

all() returns TRUE only if all values of x are TRUE’s. (same as &)

10.7.2 In-Class Exercise

Use all() and any() to find out if every flight was delayed on departure by at most an hour or if any flights were delayed on arrival by five hours or more.

Using group_by allows us to do that by day:

flights |> 
  group_by(year, month, day) |> 
  summarize(
    all_delayed = all(dep_delay <= 60, na.rm = TRUE),
    any_long_delay = any(arr_delay >= 300, na.rm = TRUE),
    .groups = "drop"
  )

# A tibble: 365 × 5
    year month   day all_delayed any_long_delay
   <int> <int> <int> <lgl>       <lgl>         
 1  2013     1     1 FALSE       TRUE          
 2  2013     1     2 FALSE       TRUE          
 3  2013     1     3 FALSE       FALSE         
 4  2013     1     4 FALSE       FALSE         
 5  2013     1     5 FALSE       TRUE          
 6  2013     1     6 FALSE       FALSE         
 7  2013     1     7 FALSE       TRUE          
 8  2013     1     8 FALSE       FALSE         
 9  2013     1     9 FALSE       TRUE          
10  2013     1    10 FALSE       TRUE          
# ℹ 355 more rows

10.7.3 Numeric summaries of logical vectors

sum(x) gives the number of TRUEs.
mean(x) gives the proportion of TRUEs.

10.7.4 In-Class Exercise

What are the proportion of flights that were delayed on departure by at most an hour, and the number of flights that were delayed on arrival by five hours or more?

flights |> 
  group_by(year, month, day) |> 
  summarize(
    proportion_delayed = mean(dep_delay <= 60, na.rm = TRUE),
    count_long_delay = sum(arr_delay >= 300, na.rm = TRUE),
    .groups = "drop"
  )

# A tibble: 365 × 5
    year month   day proportion_delayed count_long_delay
   <int> <int> <int>              <dbl>            <int>
 1  2013     1     1              0.939                3
 2  2013     1     2              0.914                3
 3  2013     1     3              0.941                0
 4  2013     1     4              0.953                0
 5  2013     1     5              0.964                1
 6  2013     1     6              0.959                0
 7  2013     1     7              0.956                1
 8  2013     1     8              0.975                0
 9  2013     1     9              0.986                1
10  2013     1    10              0.977                2
# ℹ 355 more rows

10.7.5 Logical subsetting

You can use a logical vector to filter a single variable to a subset of interest. We can subset using [].

10.7.6 In-Class Exercise

Compute the average delay for flights that arrived early.

flights |> 
  group_by(year, month, day) |> 
  summarize(
    behind = mean(arr_delay[arr_delay > 0], na.rm = TRUE),
    ahead = mean(arr_delay[arr_delay < 0], na.rm = TRUE),
    n = n(),
    .groups = "drop"
  )

# A tibble: 365 × 6
    year month   day behind ahead     n
   <int> <int> <int>  <dbl> <dbl> <int>
 1  2013     1     1   32.5 -12.5   842
 2  2013     1     2   32.0 -14.3   943
 3  2013     1     3   27.7 -18.2   914
 4  2013     1     4   28.3 -17.0   915
 5  2013     1     5   22.6 -14.0   720
 6  2013     1     6   24.4 -13.6   832
 7  2013     1     7   27.8 -17.0   933
 8  2013     1     8   20.8 -14.3   899
 9  2013     1     9   25.6 -13.0   902
10  2013     1    10   27.3 -16.4   932
# ℹ 355 more rows

10.8 Conditional Transformations

A lot of the time, we want to do one thing for condition x, and something different for condition y.

You can use conditional transformations with logical vectors to do this.

10.8.1 `if_else()`

There are four arguments to if_else:
1. condition: a logical vector.
2. true: gives the output when the condition is true.
3. false: gives the output when the condition is false.
4 (optional). missing: gives the output if the input is NA.

# assign a vector of integers between -3 and 3 to a variable, and tack an NA on the end of that vector:
x <- c(-3:3, NA)

Let’s use if_else to label a vector as either positive (“+ve”) or negative (“-ve”):

if_else(x > 0, "+ve", "-ve", "???")

[1] "-ve" "-ve" "-ve" "-ve" "+ve" "+ve" "+ve" "???"

<>: notice something weird here? We’ll come back to it.

To stick with our new tradition of using vectors in our arguments:

# If x is negative, then print the negative of it, else print the positive
# this is kind of like abs().
if_else(x < 0, -x, x)

[1]  3  2  1  0  1  2  3 NA

You can also mix and match different vectors:

x1 <- c(NA, 1, 2, NA)
y1 <- c(3, NA, 4, 6)
if_else(is.na(x1), y1, x1)

[1] 3 1 2 6

10.8.2 `case_when`

This is a flexible way of performing different computations for different conditions.

The syntax is not something we are very used to in the tidyverse. It contains pairs of condition ~ output, where condition is a logical vector. When condition == TRUE, the output is used.

The reason we use it is because if_else can get ugly quickly. For instance, if we want to resolve the fact that zero is neither positive or negative in our x vector above (see the <>), we have to do some fancy nesting:

if_else(x == 0, "0", if_else(x < 0, "-ve", "+ve"), "???")

[1] "-ve" "-ve" "-ve" "0"   "+ve" "+ve" "+ve" "???"

But case_when can reslove this quite easily:

x <- c(-3:3, NA)
case_when(
  x == 0   ~ "0",
  x < 0    ~ "-ve", 
  x > 0    ~ "+ve",
  is.na(x) ~ "???"
)

[1] "-ve" "-ve" "-ve" "0"   "+ve" "+ve" "+ve" "???"

10.8.3 In-Class Challenge

Use case_when to add a status columns containing human-readable labels to the flights table that describe arrival delays from the arr_delay column.
1. When arr_delay is missing, the flight is “cancelled”.
2. When arr_delay is 30 minutes early or more, it is “very early”.
3, When arr_delay is less than 15 minutes early or late, it is “on time”.
4. When arr_delay is less than 60 minutes late, it is “late”.
5. Whenarr_delayis more than an hour but less thanInf\ (infinity) late, it is “very late”

Hint: keep the “used” columns.

flights |> 
  mutate(
    status = case_when(
      is.na(arr_delay)      ~ "cancelled",
      arr_delay < -30       ~ "very early",
      arr_delay < -15       ~ "early",
      abs(arr_delay) <= 15  ~ "on time",
      arr_delay < 60        ~ "late",
      arr_delay < Inf       ~ "very late",
    ),
    .keep = "used"
  )

# A tibble: 336,776 × 2
   arr_delay status 
       <dbl> <chr>  
 1        11 on time
 2        20 late   
 3        33 late   
 4       -18 early  
 5       -25 early  
 6        12 on time
 7        19 late   
 8       -14 on time
 9        -8 on time
10         8 on time
# ℹ 336,766 more rows

11 Next Steps

Next, you’ll learn how to manipulate and clean strings using the stringr package.