4  Data Transformation with dplyr (Part 1)

4.1 Learning Objectives

By the end of this chapter, you should be able to:

  • Filter rows using filter()
  • Sort rows using arrange()
  • Select columns using select()
  • Create or modify columns using mutate()
  • Combine multiple transformations using the base R pipe |>

4.2 Introduction

This chapter follows R for Data Science (Ch. 3) and introduces dplyr, a tidyverse package for data transformation.
We will use the nycflights13::flights dataset for examples.


4.3 Working with Rows

4.3.1 filter()

filter() keeps rows that match given conditions.

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)

flights |>
  filter(month == 1, day == 1)
# A tibble: 842 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 832 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

What’s a tibble? See Appendix C: Tidyverse and Tibbles

4.3.2 arrange()

arrange() orders rows by a column.

flights |>
  arrange(desc(dep_delay))
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     9      641            900      1301     1242           1530
 2  2013     6    15     1432           1935      1137     1607           2120
 3  2013     1    10     1121           1635      1126     1239           1810
 4  2013     9    20     1139           1845      1014     1457           2210
 5  2013     7    22      845           1600      1005     1044           1815
 6  2013     4    10     1100           1900       960     1342           2211
 7  2013     3    17     2321            810       911      135           1020
 8  2013     6    27      959           1900       899     1236           2226
 9  2013     7    22     2257            759       898      121           1026
10  2013    12     5      756           1700       896     1058           2020
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

4.3.3 In-Class Exercise 1 – Rows

Using the flights dataset:

  1. Filter for flights departing from JFK in July.
  2. Arrange by arrival delay (largest to smallest).
  3. Identify the flight with the worst delay.

4.4 Working with Columns

4.4.1 select()

select() chooses columns.

flights |>
  select(year, month, day, dep_delay, arr_delay)
# A tibble: 336,776 × 5
    year month   day dep_delay arr_delay
   <int> <int> <int>     <dbl>     <dbl>
 1  2013     1     1         2        11
 2  2013     1     1         4        20
 3  2013     1     1         2        33
 4  2013     1     1        -1       -18
 5  2013     1     1        -6       -25
 6  2013     1     1        -4        12
 7  2013     1     1        -5        19
 8  2013     1     1        -3       -14
 9  2013     1     1        -3        -8
10  2013     1     1        -2         8
# ℹ 336,766 more rows

4.4.2 mutate()

mutate() creates or modifies columns.

flights |>
  mutate(speed = distance / air_time * 60) |>
  select(tailnum, distance, air_time, speed)
# A tibble: 336,776 × 4
   tailnum distance air_time speed
   <chr>      <dbl>    <dbl> <dbl>
 1 N14228      1400      227  370.
 2 N24211      1416      227  374.
 3 N619AA      1089      160  408.
 4 N804JB      1576      183  517.
 5 N668DN       762      116  394.
 6 N39463       719      150  288.
 7 N516JB      1065      158  404.
 8 N829AS       229       53  259.
 9 N593JB       944      140  405.
10 N3ALAA       733      138  319.
# ℹ 336,766 more rows

4.4.3 In-Class Exercise 2 – Columns

  1. Select carrier, flight, dep_delay, and arr_delay.
  2. Create a column gain = arr_delay - dep_delay.
  3. Display the first 10 rows.

4.5 Using Pipes to Combine Steps

The base R pipe |> passes results from one function to the next, making code easier to read.

flights |>
  filter(month == 6, origin == "JFK") |>
  select(carrier, flight, dep_delay, arr_delay) |>
  mutate(gain = arr_delay - dep_delay) |>
  arrange(desc(gain)) |>
  head()
# A tibble: 6 × 5
  carrier flight dep_delay arr_delay  gain
  <chr>    <int>     <dbl>     <dbl> <dbl>
1 B6        2402        -2       142   144
2 DL         706        -3       138   141
3 AA         181        -2       132   134
4 DL        1394       224       350   126
5 B6          83        36       160   124
6 DL         161       278       400   122

4.5.1 In-Class Exercise 3 – Pipes

Chain these steps using |>:

  1. Filter flights from JFK in June.
  2. Select carrier, flight, dep_delay, arr_delay.
  3. Create a column gain.
  4. Arrange by largest gain and show the top 5.

4.6 Homework Preview

For Homework, you will:

  • Use flights or another dataset.
  • Filter for a subset of interest.
  • Create at least two new variables with mutate().
  • Sort using arrange().
  • Save the transformed dataset and inspect it with glimpse() and summary().

Render to PDF and submit on Canvas.


4.7 Next Steps

Next week, we will extend these skills with group_by() and summarize() to calculate grouped summaries.