5  Data Transformation with dplyr (Part 2)

5.1 Learning Objectives

By the end of this chapter, you should be able to:

  • Group data with group_by()
  • Compute summary statistics with summarize()
  • Use multiple summaries with grouped data
  • Combine multiple datasets using join functions
  • Practice chaining multiple verbs with the pipe |>

5.2 Grouped Summaries

Grouping allows you to calculate statistics per group.
We will use the nycflights13::flights dataset.

5.2.1 group_by() and summarize()

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)

flights |>
  group_by(carrier) |>
  summarize(
    delay = mean(dep_delay, na.rm = TRUE)
  )
# A tibble: 16 × 2
   carrier delay
   <chr>   <dbl>
 1 9E      16.7 
 2 AA       8.59
 3 AS       5.80
 4 B6      13.0 
 5 DL       9.26
 6 EV      20.0 
 7 F9      20.2 
 8 FL      18.7 
 9 HA       4.90
10 MQ      10.6 
11 OO      12.6 
12 UA      12.1 
13 US       3.78
14 VX      12.9 
15 WN      17.7 
16 YV      19.0 

5.3 Multiple Summaries

flights |>
  group_by(dest) |>
  summarize(
    count = n(),
    avg_delay = mean(arr_delay, na.rm = TRUE),
    .groups = "drop"
  )
# A tibble: 105 × 3
   dest  count avg_delay
   <chr> <int>     <dbl>
 1 ABQ     254      4.38
 2 ACK     265      4.85
 3 ALB     439     14.4 
 4 ANC       8     -2.5 
 5 ATL   17215     11.3 
 6 AUS    2439      6.02
 7 AVL     275      8.00
 8 BDL     443      7.05
 9 BGR     375      8.03
10 BHM     297     16.9 
# ℹ 95 more rows

5.3.1 In-Class Exercise 1 – Grouped Summaries

Using flights:

  1. Group by origin and calculate the average departure delay.
  2. Group by carrier and find the number of flights and average arrival delay.
  3. Which carrier has the highest average arrival delay?

5.4 Grouping with Multiple Variables

You can group by multiple columns at once.

flights |>
  group_by(origin, month) |>
  summarize(
    avg_delay = mean(dep_delay, na.rm = TRUE),
    .groups = "drop_last"
  )
# A tibble: 36 × 3
# Groups:   origin [3]
   origin month avg_delay
   <chr>  <int>     <dbl>
 1 EWR        1     14.9 
 2 EWR        2     13.1 
 3 EWR        3     18.1 
 4 EWR        4     17.4 
 5 EWR        5     15.4 
 6 EWR        6     22.5 
 7 EWR        7     22.0 
 8 EWR        8     13.5 
 9 EWR        9      7.29
10 EWR       10      8.64
# ℹ 26 more rows

5.4.1 In-Class Exercise 2 – Multiple Grouping

  1. Group by origin and carrier.
  2. Summarize with the average air_time.
  3. Arrange results to see which origin-carrier combination has the longest average flights.

5.5 Joining Datasets

dplyr provides functions to join tables by a common key:

Example using flights and airlines:

flights |>
  left_join(airlines, by = "carrier") |>
  select(name, carrier, flight) |>
  head()
# A tibble: 6 × 3
  name                   carrier flight
  <chr>                  <chr>    <int>
1 United Air Lines Inc.  UA        1545
2 United Air Lines Inc.  UA        1714
3 American Airlines Inc. AA        1141
4 JetBlue Airways        B6         725
5 Delta Air Lines Inc.   DL         461
6 United Air Lines Inc.  UA        1696

5.5.1 In-Class Exercise 3 – Joins

  1. Use left_join() to add airline names to the flights dataset.
  2. Use count() to find how many flights each airline operates.
  3. Arrange results by the number of flights.

5.6 Chaining with Pipes

We can combine group_by(), summarize(), and joins in a single pipeline.

flights |>
  left_join(airlines, by = "carrier") |>
  group_by(name) |>
  summarize(
    flights = n(),
    avg_delay = mean(dep_delay, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(avg_delay))
# A tibble: 16 × 3
   name                        flights avg_delay
   <chr>                         <int>     <dbl>
 1 Frontier Airlines Inc.          685     20.2 
 2 ExpressJet Airlines Inc.      54173     20.0 
 3 Mesa Airlines Inc.              601     19.0 
 4 AirTran Airways Corporation    3260     18.7 
 5 Southwest Airlines Co.        12275     17.7 
 6 Endeavor Air Inc.             18460     16.7 
 7 JetBlue Airways               54635     13.0 
 8 Virgin America                 5162     12.9 
 9 SkyWest Airlines Inc.            32     12.6 
10 United Air Lines Inc.         58665     12.1 
11 Envoy Air                     26397     10.6 
12 Delta Air Lines Inc.          48110      9.26
13 American Airlines Inc.        32729      8.59
14 Alaska Airlines Inc.            714      5.80
15 Hawaiian Airlines Inc.          342      4.90
16 US Airways Inc.               20536      3.78

5.7 In-Class Challenge

Using the flights dataset:

  • Join airline names
  • Group by airline name
  • Summarize number of flights, average departure delay, and average arrival delay
  • Arrange by average arrival delay
  • Identify the airline with the longest delays

5.8 Homework Preview

For homework, extend your data transformation by:

  • Grouping data by at least one variable
  • Calculating at least two summary statistics
  • Joining an additional dataset (e.g., airlines, airports)
  • Rendering your results as a table in your PDF

5.9 Next Steps

Next week, we will explore tidy data principles and learn how to reshape datasets using tidyr.