flights|>left_join(airlines, by ="carrier")|>select(name, carrier, flight)|>head()
# A tibble: 6 × 3
name carrier flight
<chr> <chr> <int>
1 United Air Lines Inc. UA 1545
2 United Air Lines Inc. UA 1714
3 American Airlines Inc. AA 1141
4 JetBlue Airways B6 725
5 Delta Air Lines Inc. DL 461
6 United Air Lines Inc. UA 1696
5.5.1 In-Class Exercise 3 – Joins
Use left_join() to add airline names to the flights dataset.
Use count() to find how many flights each airline operates.
---title: "Data Transformation with dplyr (Part 2)"---## Learning ObjectivesBy the end of this chapter, you should be able to:- Group data with `group_by()`- Compute summary statistics with `summarize()`- Use multiple summaries with grouped data- Combine multiple datasets using `join` functions- Practice chaining multiple verbs with the pipe `|>`------------------------------------------------------------------------## Grouped SummariesGrouping allows you to calculate statistics **per group**.\We will use the `nycflights13::flights` dataset.### `group_by()` and `summarize()````{r}library(tidyverse)library(nycflights13)flights |>group_by(carrier) |>summarize(delay =mean(dep_delay, na.rm =TRUE) )```## Multiple Summaries```{r}flights |>group_by(dest) |>summarize(count =n(),avg_delay =mean(arr_delay, na.rm =TRUE),.groups ="drop" )```------------------------------------------------------------------------### In-Class Exercise 1 – Grouped SummariesUsing `flights`:1. Group by `origin` and calculate the average departure delay.\2. Group by `carrier` and find the number of flights and average arrival delay.\3. Which carrier has the highest average arrival delay?------------------------------------------------------------------------## Grouping with Multiple VariablesYou can group by multiple columns at once.```{r}flights |>group_by(origin, month) |>summarize(avg_delay =mean(dep_delay, na.rm =TRUE),.groups ="drop_last" )```------------------------------------------------------------------------### In-Class Exercise 2 – Multiple Grouping1. Group by `origin` and `carrier`.\2. Summarize with the average `air_time`.\3. Arrange results to see which origin-carrier combination has the longest average flights.------------------------------------------------------------------------## Joining Datasets`dplyr` provides functions to join tables by a common key:- `left_join()`\- `inner_join()`\- `right_join()`\- `full_join()`Example using `flights` and `airlines`:```{r}flights |>left_join(airlines, by ="carrier") |>select(name, carrier, flight) |>head()```------------------------------------------------------------------------### In-Class Exercise 3 – Joins1. Use `left_join()` to add airline names to the flights dataset.\2. Use `count()` to find how many flights each airline operates.\3. Arrange results by the number of flights.------------------------------------------------------------------------## Chaining with PipesWe can combine `group_by()`, `summarize()`, and joins in a single pipeline.```{r}flights |>left_join(airlines, by ="carrier") |>group_by(name) |>summarize(flights =n(),avg_delay =mean(dep_delay, na.rm =TRUE),.groups ="drop" ) |>arrange(desc(avg_delay))```------------------------------------------------------------------------## In-Class ChallengeUsing the flights dataset:- Join airline names\- Group by airline name\- Summarize number of flights, average departure delay, and average arrival delay\- Arrange by average arrival delay\- Identify the airline with the longest delays------------------------------------------------------------------------## Homework PreviewFor homework, extend your data transformation by:- Grouping data by at least one variable\- Calculating at least two summary statistics\- Joining an additional dataset (e.g., airlines, airports)\- Rendering your results as a table in your PDF------------------------------------------------------------------------## Next StepsNext week, we will explore **tidy data principles** and learn how to reshape datasets using `tidyr`.