Sometimes primary keys may contain unique information but are not actually useful airports$altitude or airports$latitude may be unique, but that is not useful for data analysis of airline flights (for most people).
A good surrogate key would be to add rows names as unique identifiers:
# A tibble: 6 × 3
airline_name carrier flight
<chr> <chr> <int>
1 United Air Lines Inc. UA 1545
2 United Air Lines Inc. UA 1714
3 American Airlines Inc. AA 1141
4 JetBlue Airways B6 725
5 Delta Air Lines Inc. DL 461
6 United Air Lines Inc. UA 1696
13.4.1.1inner_join() keeps only matching rows.
13.4.1.2full_join() keeps all rows from both tables.
13.5 Filtering joins
13.5.0.1semi_join() and anti_join():
semi_join(): keeps rows in first table with matches in second
flights|>left_join(airlines, by ="carrier")|>rename(airline_name =name)|>left_join(airports, by =c("dest"="faa"))|>select(airline_name, dest, arr_delay)|>head()
# A tibble: 6 × 3
airline_name dest arr_delay
<chr> <chr> <dbl>
1 United Air Lines Inc. IAH 11
2 United Air Lines Inc. IAH 20
3 American Airlines Inc. MIA 33
4 JetBlue Airways BQN -18
5 Delta Air Lines Inc. ATL -25
6 United Air Lines Inc. ORD 12
13.6.1 In-Class Exercise 2 – Multi-Table Joins
Join flights with airports to add destination airport names.
flights|>left_join(airlines, by ="carrier")|>rename(airline_name =name)|>count(carrier, airline_name)
# A tibble: 16 × 3
carrier airline_name n
<chr> <chr> <int>
1 9E Endeavor Air Inc. 18460
2 AA American Airlines Inc. 32729
3 AS Alaska Airlines Inc. 714
4 B6 JetBlue Airways 54635
5 DL Delta Air Lines Inc. 48110
6 EV ExpressJet Airlines Inc. 54173
7 F9 Frontier Airlines Inc. 685
8 FL AirTran Airways Corporation 3260
9 HA Hawaiian Airlines Inc. 342
10 MQ Envoy Air 26397
11 OO SkyWest Airlines Inc. 32
12 UA United Air Lines Inc. 58665
13 US US Airways Inc. 20536
14 VX Virgin America 5162
15 WN Southwest Airlines Co. 12275
16 YV Mesa Airlines Inc. 601
13.8 In-Class Challenge – Join Workflow
Join flights with airlines and airports
Calculate average arrival delay by airline and destination
Arrange by delay and identify the worst-performing routes
---title: "Relational Data with dplyr Joins"---## Learning ObjectivesBy the end of this chapter, you should be able to:- Understand the concept of relational data and keys- Combine multiple datasets using different join functions- Use `left_join()`, `inner_join()`, `full_join()`, and `semi_join()`- Diagnose and handle join problems (missing keys, duplicates)- Apply joins in analysis workflows------------------------------------------------------------------------## What is Relational Data?Relational data consists of **multiple tables** that can be linked by **keys**.Example tables from `nycflights13`:- `flights`: flight information- `airlines`: airline names- `airports`: airport locations- `planes`: plane details- `weather`: weather data------------------------------------------------------------------------## Keys- **Primary key**: uniquely identifies each row in a table - **Compound key:** when multiple variables are needed.\- **Foreign key**: column that matches a primary key in another tableExample: `flights$carrier` matches `airlines$carrier`.### Surrogate KeysSometimes primary keys may contain unique information but are not actually useful `airports$altitude` or `airports$latitude` may be unique, but that is not useful for data analysis of airline flights (for most people).A good surrogate key would be to add rows names as unique identifiers:```{r}library(tidyverse)library(nycflights13)flights2 <- flights |>mutate(id =row_number(), .before =1)flights2```------------------------------------------------------------------------## Mutating joins with dplyr`dplyr` join functions merge tables by keys.### `left_join()`Keeps all rows from the first table:```{r}library(tidyverse)library(nycflights13)flights |>left_join(airlines, by ="carrier") |>rename(airline_name = name) |>select(airline_name, carrier, flight) |>head()```------------------------------------------------------------------------#### `inner_join()` keeps only matching rows.#### `full_join()` keeps all rows from both tables.------------------------------------------------------------------------## Filtering joins#### `semi_join()` and `anti_join()`:- `semi_join()`: keeps rows in first table with matches in second\- `anti_join()`: keeps rows with no matches------------------------------------------------------------------------## Joining Multiple TablesYou can chain joins to combine several datasets:```{r}flights |>left_join(airlines, by ="carrier") |>rename(airline_name = name) |>left_join(airports, by =c("dest"="faa")) |>select(airline_name, dest, arr_delay) |>head()```------------------------------------------------------------------------### In-Class Exercise 2 – Multi-Table Joins1. Join `flights` with `airports` to add destination airport names.\2. Summarize average arrival delay by airport.\3. Which airport has the longest average delay?------------------------------------------------------------------------## Handling Join Problems- Missing keys → results in `NA` values\- Duplicated keys → may create duplicate rows\- Always check results with `count()` or `distinct()`Example:```{r}flights |>left_join(airlines, by ="carrier") |>rename(airline_name = name) |>count(carrier, airline_name)```------------------------------------------------------------------------## In-Class Challenge – Join Workflow- Join flights with airlines and airports\- Calculate average arrival delay by airline and destination\- Arrange by delay and identify the worst-performing routes------------------------------------------------------------------------## Homework PreviewFor the next homework, you will:- Combine at least two datasets using joins- Use at least two different join types (`left_join()`, `inner_join()`, etc.)- Handle missing data or duplicates appropriately- Produce a summary table and one visualization based on the joined data- Render to PDF and submit on Canvas------------------------------------------------------------------------## Next StepsNext, we will introduce **accessing data** using spreadsheets, SQL databases, JSON, and web scraping.