5 Data Transformation with dplyr (Part 2)

5.1 Learning Objectives

By the end of this chapter, you should be able to:

Group data with group_by()
Compute summary statistics with summarize()
Use multiple summaries with grouped data
Combine multiple datasets using join functions
Practice chaining multiple verbs with the pipe |>

5.2 Grouped Summaries

Grouping allows you to calculate statistics per group.
We will use the nycflights13::flights dataset.

5.2.1 `group_by()` and `summarize()`

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(nycflights13)

flights |>
  group_by(carrier) |>
  summarize(
    delay = mean(dep_delay, na.rm = TRUE)
  )

# A tibble: 16 × 2
   carrier delay
   <chr>   <dbl>
 1 9E      16.7 
 2 AA       8.59
 3 AS       5.80
 4 B6      13.0 
 5 DL       9.26
 6 EV      20.0 
 7 F9      20.2 
 8 FL      18.7 
 9 HA       4.90
10 MQ      10.6 
11 OO      12.6 
12 UA      12.1 
13 US       3.78
14 VX      12.9 
15 WN      17.7 
16 YV      19.0

5.3 Multiple Summaries

flights |>
  group_by(dest) |>
  summarize(
    count = n(),
    avg_delay = mean(arr_delay, na.rm = TRUE),
    .groups = "drop"
  )

# A tibble: 105 × 3
   dest  count avg_delay
   <chr> <int>     <dbl>
 1 ABQ     254      4.38
 2 ACK     265      4.85
 3 ALB     439     14.4 
 4 ANC       8     -2.5 
 5 ATL   17215     11.3 
 6 AUS    2439      6.02
 7 AVL     275      8.00
 8 BDL     443      7.05
 9 BGR     375      8.03
10 BHM     297     16.9 
# ℹ 95 more rows

5.3.1 In-Class Exercise 1 – Grouped Summaries

Using flights:

Group by origin and calculate the average departure delay.
Group by carrier and find the number of flights and average arrival delay.
Which carrier has the highest average arrival delay?

5.4 Grouping with Multiple Variables

You can group by multiple columns at once.

flights |>
  group_by(origin, month) |>
  summarize(
    avg_delay = mean(dep_delay, na.rm = TRUE),
    .groups = "drop_last"
  )

# A tibble: 36 × 3
# Groups:   origin [3]
   origin month avg_delay
   <chr>  <int>     <dbl>
 1 EWR        1     14.9 
 2 EWR        2     13.1 
 3 EWR        3     18.1 
 4 EWR        4     17.4 
 5 EWR        5     15.4 
 6 EWR        6     22.5 
 7 EWR        7     22.0 
 8 EWR        8     13.5 
 9 EWR        9      7.29
10 EWR       10      8.64
# ℹ 26 more rows

5.4.1 In-Class Exercise 2 – Multiple Grouping

Group by origin and carrier.
Summarize with the average air_time.
Arrange results to see which origin-carrier combination has the longest average flights.

5.5 Joining Datasets

dplyr provides functions to join tables by a common key:

Example using flights and airlines:

flights |>
  left_join(airlines, by = "carrier") |>
  select(name, carrier, flight) |>
  head()

# A tibble: 6 × 3
  name                   carrier flight
  <chr>                  <chr>    <int>
1 United Air Lines Inc.  UA        1545
2 United Air Lines Inc.  UA        1714
3 American Airlines Inc. AA        1141
4 JetBlue Airways        B6         725
5 Delta Air Lines Inc.   DL         461
6 United Air Lines Inc.  UA        1696

5.5.1 In-Class Exercise 3 – Joins

Use left_join() to add airline names to the flights dataset.
Use count() to find how many flights each airline operates.
Arrange results by the number of flights.

5.6 Chaining with Pipes

We can combine group_by(), summarize(), and joins in a single pipeline.

flights |>
  left_join(airlines, by = "carrier") |>
  group_by(name) |>
  summarize(
    flights = n(),
    avg_delay = mean(dep_delay, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(avg_delay))

# A tibble: 16 × 3
   name                        flights avg_delay
   <chr>                         <int>     <dbl>
 1 Frontier Airlines Inc.          685     20.2 
 2 ExpressJet Airlines Inc.      54173     20.0 
 3 Mesa Airlines Inc.              601     19.0 
 4 AirTran Airways Corporation    3260     18.7 
 5 Southwest Airlines Co.        12275     17.7 
 6 Endeavor Air Inc.             18460     16.7 
 7 JetBlue Airways               54635     13.0 
 8 Virgin America                 5162     12.9 
 9 SkyWest Airlines Inc.            32     12.6 
10 United Air Lines Inc.         58665     12.1 
11 Envoy Air                     26397     10.6 
12 Delta Air Lines Inc.          48110      9.26
13 American Airlines Inc.        32729      8.59
14 Alaska Airlines Inc.            714      5.80
15 Hawaiian Airlines Inc.          342      4.90
16 US Airways Inc.               20536      3.78

5.7 In-Class Challenge

Using the flights dataset:

Join airline names
Group by airline name
Summarize number of flights, average departure delay, and average arrival delay
Arrange by average arrival delay
Identify the airline with the longest delays

5.8 Homework Preview

For homework, extend your data transformation by:

Grouping data by at least one variable
Calculating at least two summary statistics
Joining an additional dataset (e.g., airlines, airports)
Rendering your results as a table in your PDF

5.9 Next Steps

Next week, we will explore tidy data principles and learn how to reshape datasets using tidyr.