x <- c(TRUE, FALSE, TRUE, NA)
x[1] TRUE FALSE TRUE NA
What are the types of variables we see in data frames, and what are the different tools we can use to work with them?
By the end of this chapter, you should be able to:
Logical vectors contain only TRUE, FALSE, or NA.
x <- c(TRUE, FALSE, TRUE, NA)
x[1] TRUE FALSE TRUE NA
nums <- c(2, 5, 8, 1)
nums > 4[1] FALSE TRUE TRUE FALSE
You can use these directly with functions like sum() and mean():
sum(nums > 4) # Count how many values are > 4[1] 2
mean(nums > 4) # Proportion of values > 4[1] 0.5
# use sample() to create a random list:
rands <- sample(x = c(1:10), size = 10)
sum(rands > mean(rands)) # number of values above the mean[1] 5
[1] 0.5
Combine logical vectors with & (and), | (or), and ! (not):
[1] TRUE FALSE FALSE
a | b # or[1] TRUE TRUE TRUE
!a # not[1] FALSE TRUE FALSE
mpg dataset, create a logical condition for cars with hwy > 30 and cyl == 4.── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
[1] 22
The following filter (using filter) finds all daytime departures that arrive roughly on time:
library(nycflights13)
flights |>
filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)# A tibble: 172,286 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 601 600 1 844 850
2 2013 1 1 602 610 -8 812 820
3 2013 1 1 602 605 -3 821 805
4 2013 1 1 606 610 -4 858 910
5 2013 1 1 606 610 -4 837 845
6 2013 1 1 607 607 0 858 915
7 2013 1 1 611 600 11 945 931
8 2013 1 1 613 610 3 925 921
9 2013 1 1 615 615 0 833 842
10 2013 1 1 622 630 -8 1017 1014
# ℹ 172,276 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
The underlying logical variables can be made visible with mutate():
flights |>
mutate(
daytime = dep_time > 600 & dep_time < 2000,
approx_ontime = abs(arr_delay) < 20,
.keep = "used"
)# A tibble: 336,776 × 4
dep_time arr_delay daytime approx_ontime
<int> <dbl> <lgl> <lgl>
1 517 11 FALSE TRUE
2 533 20 FALSE FALSE
3 542 33 FALSE FALSE
4 544 -18 FALSE TRUE
5 554 -25 FALSE FALSE
6 554 12 FALSE TRUE
7 555 19 FALSE TRUE
8 557 -14 FALSE TRUE
9 557 -8 FALSE TRUE
10 558 8 FALSE TRUE
# ℹ 336,766 more rows
Which really means that the first filter is equivalent to:
flights |>
mutate(
daytime = dep_time > 600 & dep_time < 2000,
approx_ontime = abs(arr_delay) < 20,
) |>
filter(daytime & approx_ontime)# A tibble: 172,286 × 21
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 601 600 1 844 850
2 2013 1 1 602 610 -8 812 820
3 2013 1 1 602 605 -3 821 805
4 2013 1 1 606 610 -4 858 910
5 2013 1 1 606 610 -4 837 845
6 2013 1 1 607 607 0 858 915
7 2013 1 1 611 600 11 945 931
8 2013 1 1 613 610 3 925 921
9 2013 1 1 615 615 0 833 842
10 2013 1 1 622 630 -8 1017 1014
# ℹ 172,276 more rows
# ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>, daytime <lgl>,
# approx_ontime <lgl>
What happens when you use == with numbers? Check it out:
What are the outcomes of these two equations?
Looks like 1 and 2.
Now look:
x == c(1, 2)[1] FALSE FALSE
You get all FALSE. This is because there’s no way to exactly represent 1/49 or sqrt(2) without ROUNDING:
print(x, digits = 16)[1] 0.9999999999999999 2.0000000000000004
So == is dangerous with numbers. But, you can use dplyr::near:
Missing values pop up easily in R because unknown values are always evaluated that way. If:
NA > 5[1] NA
Then the following should be true:
10 == NA[1] NA
Same here:
NA == NA[1] NA
Print the flights where dep_time is missing. Well, you might want to do this:
flights |>
filter(dep_time == NA)# A tibble: 0 × 19
# ℹ 19 variables: year <int>, month <int>, day <int>, dep_time <int>,
# sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
But it gives NA for every row, so you get nothing. This is where is.na() comes in handy:
# A tibble: 8,255 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 NA 1630 NA NA 1815
2 2013 1 1 NA 1935 NA NA 2240
3 2013 1 1 NA 1500 NA NA 1825
4 2013 1 1 NA 600 NA NA 901
5 2013 1 2 NA 1540 NA NA 1747
6 2013 1 2 NA 1620 NA NA 1746
7 2013 1 2 NA 1355 NA NA 1459
8 2013 1 2 NA 1420 NA NA 1644
9 2013 1 2 NA 1321 NA NA 1536
10 2013 1 2 NA 1545 NA NA 1910
# ℹ 8,245 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
You can combine logical vectors using Boolean algebra.
For instance, this will find all rows where x is not missing.
This will find all rows where x is smaller than -10 or bigger than 0.
df |>
filter(x < -10 | x > 0)# A tibble: 3 × 3
x and or
<lgl> <lgl> <lgl>
1 TRUE NA TRUE
2 FALSE FALSE NA
3 NA NA NA
This is based on “TRUE and NA” evaluates to NA while “TRUE or NA” evaluates to TRUE (because at least one of them is TRUE).
%in%
Note the difference between these two outputs:
flights |>
filter(month == 11 | month == 12)# A tibble: 55,403 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 11 1 5 2359 6 352 345
2 2013 11 1 35 2250 105 123 2356
3 2013 11 1 455 500 -5 641 651
4 2013 11 1 539 545 -6 856 827
5 2013 11 1 542 545 -3 831 855
6 2013 11 1 549 600 -11 912 923
7 2013 11 1 550 600 -10 705 659
8 2013 11 1 554 600 -6 659 701
9 2013 11 1 554 600 -6 826 827
10 2013 11 1 554 600 -6 749 751
# ℹ 55,393 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
flights |>
filter(month == 11 | 12)# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
The second one doesn’t work properly. It doesn’t evaluate as “Find all flights that departed in November or December.”
This is because month == 11 creates a logical vector. This vector is then compared to 12, which evaluates to TRUE - so it prints the whole table. Here is an illustration:
flights |>
mutate(
nov = month == 11,
final = nov | 12,
.keep = "used"
)# A tibble: 336,776 × 3
month nov final
<int> <lgl> <lgl>
1 1 FALSE TRUE
2 1 FALSE TRUE
3 1 FALSE TRUE
4 1 FALSE TRUE
5 1 FALSE TRUE
6 1 FALSE TRUE
7 1 FALSE TRUE
8 1 FALSE TRUE
9 1 FALSE TRUE
10 1 FALSE TRUE
# ℹ 336,766 more rows
This is where %in% comes in. x %in% y outputs a logical vector the same length as x that is TRUE whenever a value in x is in y.
Find all the flights in November and December with %in%
# A tibble: 55,403 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 11 1 5 2359 6 352 345
2 2013 11 1 35 2250 105 123 2356
3 2013 11 1 455 500 -5 641 651
4 2013 11 1 539 545 -6 856 827
5 2013 11 1 542 545 -3 831 855
6 2013 11 1 549 600 -11 912 923
7 2013 11 1 550 600 -10 705 659
8 2013 11 1 554 600 -6 659 701
9 2013 11 1 554 600 -6 826 827
10 2013 11 1 554 600 -6 749 751
# ℹ 55,393 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
arr_delay is missing but dep_delay is not.arr_time nor sched_arr_time are missing, but arr_delay is.any() and all()
any() returns TRUE if there are any TRUEs in x. (same as |)
all() returns TRUE only if all values of x are TRUE’s. (same as &)
Use all() and any() to find out if every flight was delayed on departure by at most an hour or if any flights were delayed on arrival by five hours or more.
Using group_by allows us to do that by day:
flights |>
group_by(year, month, day) |>
summarize(
all_delayed = all(dep_delay <= 60, na.rm = TRUE),
any_long_delay = any(arr_delay >= 300, na.rm = TRUE),
.groups = "drop"
)# A tibble: 365 × 5
year month day all_delayed any_long_delay
<int> <int> <int> <lgl> <lgl>
1 2013 1 1 FALSE TRUE
2 2013 1 2 FALSE TRUE
3 2013 1 3 FALSE FALSE
4 2013 1 4 FALSE FALSE
5 2013 1 5 FALSE TRUE
6 2013 1 6 FALSE FALSE
7 2013 1 7 FALSE TRUE
8 2013 1 8 FALSE FALSE
9 2013 1 9 FALSE TRUE
10 2013 1 10 FALSE TRUE
# ℹ 355 more rows
sum(x) gives the number of TRUEs.mean(x) gives the proportion of TRUEs.What are the proportion of flights that were delayed on departure by at most an hour, and the number of flights that were delayed on arrival by five hours or more?
flights |>
group_by(year, month, day) |>
summarize(
proportion_delayed = mean(dep_delay <= 60, na.rm = TRUE),
count_long_delay = sum(arr_delay >= 300, na.rm = TRUE),
.groups = "drop"
)# A tibble: 365 × 5
year month day proportion_delayed count_long_delay
<int> <int> <int> <dbl> <int>
1 2013 1 1 0.939 3
2 2013 1 2 0.914 3
3 2013 1 3 0.941 0
4 2013 1 4 0.953 0
5 2013 1 5 0.964 1
6 2013 1 6 0.959 0
7 2013 1 7 0.956 1
8 2013 1 8 0.975 0
9 2013 1 9 0.986 1
10 2013 1 10 0.977 2
# ℹ 355 more rows
You can use a logical vector to filter a single variable to a subset of interest. We can subset using [].
Compute the average delay for flights that arrived early.
flights |>
group_by(year, month, day) |>
summarize(
behind = mean(arr_delay[arr_delay > 0], na.rm = TRUE),
ahead = mean(arr_delay[arr_delay < 0], na.rm = TRUE),
n = n(),
.groups = "drop"
)# A tibble: 365 × 6
year month day behind ahead n
<int> <int> <int> <dbl> <dbl> <int>
1 2013 1 1 32.5 -12.5 842
2 2013 1 2 32.0 -14.3 943
3 2013 1 3 27.7 -18.2 914
4 2013 1 4 28.3 -17.0 915
5 2013 1 5 22.6 -14.0 720
6 2013 1 6 24.4 -13.6 832
7 2013 1 7 27.8 -17.0 933
8 2013 1 8 20.8 -14.3 899
9 2013 1 9 25.6 -13.0 902
10 2013 1 10 27.3 -16.4 932
# ℹ 355 more rows
A lot of the time, we want to do one thing for condition x, and something different for condition y.
You can use conditional transformations with logical vectors to do this.
if_else()
There are four arguments to if_else:
1. condition: a logical vector.
2. true: gives the output when the condition is true.
3. false: gives the output when the condition is false.
4 (optional). missing: gives the output if the input is NA.
# assign a vector of integers between -3 and 3 to a variable, and tack an NA on the end of that vector:
x <- c(-3:3, NA)Let’s use if_else to label a vector as either positive (“+ve”) or negative (“-ve”):
if_else(x > 0, "+ve", "-ve", "???")[1] "-ve" "-ve" "-ve" "-ve" "+ve" "+ve" "+ve" "???"
<
To stick with our new tradition of using vectors in our arguments:
# If x is negative, then print the negative of it, else print the positive
# this is kind of like abs().
if_else(x < 0, -x, x)[1] 3 2 1 0 1 2 3 NA
You can also mix and match different vectors:
case_when
This is a flexible way of performing different computations for different conditions.
The syntax is not something we are very used to in the tidyverse. It contains pairs of condition ~ output, where condition is a logical vector. When condition == TRUE, the output is used.
The reason we use it is because if_else can get ugly quickly. For instance, if we want to resolve the fact that zero is neither positive or negative in our x vector above (see the <
[1] "-ve" "-ve" "-ve" "0" "+ve" "+ve" "+ve" "???"
But case_when can reslove this quite easily:
Use case_when to add a status columns containing human-readable labels to the flights table that describe arrival delays from the arr_delay column.
1. When arr_delay is missing, the flight is “cancelled”.
2. When arr_delay is 30 minutes early or more, it is “very early”.
3, When arr_delay is less than 15 minutes early or late, it is “on time”.
4. When arr_delay is less than 60 minutes late, it is “late”.
5. Whenarr_delayis more than an hour but less thanInf\ (infinity) late, it is “very late”
Hint: keep the “used” columns.
flights |>
mutate(
status = case_when(
is.na(arr_delay) ~ "cancelled",
arr_delay < -30 ~ "very early",
arr_delay < -15 ~ "early",
abs(arr_delay) <= 15 ~ "on time",
arr_delay < 60 ~ "late",
arr_delay < Inf ~ "very late",
),
.keep = "used"
)# A tibble: 336,776 × 2
arr_delay status
<dbl> <chr>
1 11 on time
2 20 late
3 33 late
4 -18 early
5 -25 early
6 12 on time
7 19 late
8 -14 on time
9 -8 on time
10 8 on time
# ℹ 336,766 more rows
Next, you’ll learn how to manipulate and clean strings using the stringr package.