13  Relational Data with dplyr Joins

13.1 Learning Objectives

By the end of this chapter, you should be able to:

  • Understand the concept of relational data and keys
  • Combine multiple datasets using different join functions
  • Use left_join(), inner_join(), full_join(), and semi_join()
  • Diagnose and handle join problems (missing keys, duplicates)
  • Apply joins in analysis workflows

13.2 What is Relational Data?

Relational data consists of multiple tables that can be linked by keys.

Example tables from nycflights13:

  • flights: flight information
  • airlines: airline names
  • airports: airport locations
  • planes: plane details
  • weather: weather data

13.3 Keys

  • Primary key: uniquely identifies each row in a table
    • Compound key: when multiple variables are needed.
  • Foreign key: column that matches a primary key in another table

Example: flights$carrier matches airlines$carrier.

13.3.1 Surrogate Keys

Sometimes primary keys may contain unique information but are not actually useful airports$altitude or airports$latitude may be unique, but that is not useful for data analysis of airline flights (for most people).

A good surrogate key would be to add rows names as unique identifiers:

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
flights2 <- flights |>
  mutate(id = row_number(), .before = 1)
flights2
# A tibble: 336,776 × 20
      id  year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1     1  2013     1     1      517            515         2      830
 2     2  2013     1     1      533            529         4      850
 3     3  2013     1     1      542            540         2      923
 4     4  2013     1     1      544            545        -1     1004
 5     5  2013     1     1      554            600        -6      812
 6     6  2013     1     1      554            558        -4      740
 7     7  2013     1     1      555            600        -5      913
 8     8  2013     1     1      557            600        -3      709
 9     9  2013     1     1      557            600        -3      838
10    10  2013     1     1      558            600        -2      753
# ℹ 336,766 more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

13.4 Mutating joins with dplyr

dplyr join functions merge tables by keys.

13.4.1 left_join()

Keeps all rows from the first table:

library(tidyverse)
library(nycflights13)

flights |>
  left_join(airlines, by = "carrier") |>
  rename(airline_name = name) |>
  select(airline_name, carrier, flight) |>
  head()
# A tibble: 6 × 3
  airline_name           carrier flight
  <chr>                  <chr>    <int>
1 United Air Lines Inc.  UA        1545
2 United Air Lines Inc.  UA        1714
3 American Airlines Inc. AA        1141
4 JetBlue Airways        B6         725
5 Delta Air Lines Inc.   DL         461
6 United Air Lines Inc.  UA        1696

13.4.1.1 inner_join() keeps only matching rows.

13.4.1.2 full_join() keeps all rows from both tables.


13.5 Filtering joins

13.5.0.1 semi_join() and anti_join():


13.6 Joining Multiple Tables

You can chain joins to combine several datasets:

flights |>
  left_join(airlines, by = "carrier") |>
  rename(airline_name = name) |>
  left_join(airports, by = c("dest" = "faa")) |>
  select(airline_name, dest, arr_delay) |>
  head()
# A tibble: 6 × 3
  airline_name           dest  arr_delay
  <chr>                  <chr>     <dbl>
1 United Air Lines Inc.  IAH          11
2 United Air Lines Inc.  IAH          20
3 American Airlines Inc. MIA          33
4 JetBlue Airways        BQN         -18
5 Delta Air Lines Inc.   ATL         -25
6 United Air Lines Inc.  ORD          12

13.6.1 In-Class Exercise 2 – Multi-Table Joins

  1. Join flights with airports to add destination airport names.
  2. Summarize average arrival delay by airport.
  3. Which airport has the longest average delay?

13.7 Handling Join Problems

  • Missing keys → results in NA values
  • Duplicated keys → may create duplicate rows
  • Always check results with count() or distinct()

Example:

flights |>
  left_join(airlines, by = "carrier") |>
  rename(airline_name = name) |>
  count(carrier, airline_name)
# A tibble: 16 × 3
   carrier airline_name                    n
   <chr>   <chr>                       <int>
 1 9E      Endeavor Air Inc.           18460
 2 AA      American Airlines Inc.      32729
 3 AS      Alaska Airlines Inc.          714
 4 B6      JetBlue Airways             54635
 5 DL      Delta Air Lines Inc.        48110
 6 EV      ExpressJet Airlines Inc.    54173
 7 F9      Frontier Airlines Inc.        685
 8 FL      AirTran Airways Corporation  3260
 9 HA      Hawaiian Airlines Inc.        342
10 MQ      Envoy Air                   26397
11 OO      SkyWest Airlines Inc.          32
12 UA      United Air Lines Inc.       58665
13 US      US Airways Inc.             20536
14 VX      Virgin America               5162
15 WN      Southwest Airlines Co.      12275
16 YV      Mesa Airlines Inc.            601

13.8 In-Class Challenge – Join Workflow

  • Join flights with airlines and airports
  • Calculate average arrival delay by airline and destination
  • Arrange by delay and identify the worst-performing routes

13.9 Homework Preview

For the next homework, you will:

  • Combine at least two datasets using joins
  • Use at least two different join types (left_join(), inner_join(), etc.)
  • Handle missing data or duplicates appropriately
  • Produce a summary table and one visualization based on the joined data
  • Render to PDF and submit on Canvas

13.10 Next Steps

Next, we will introduce accessing data using spreadsheets, SQL databases, JSON, and web scraping.