6 Tidy Data with tidyr

6.1 Learning Objectives

By the end of this chapter, you should be able to:

Explain why tidy data improves analysis and visualization
Reshape data between wide and long formats using pivot_longer() and pivot_wider()
Separate and unite columns using separate() and unite()
Apply tidying techniques to messy real-world datasets
Prepare datasets for use with dplyr and ggplot2

6.2 Why Tidy Data?

In Week 6, you performed EDA on datasets that were already in a usable format.
Real datasets are often messy. Tidy data makes it easy to:

Use ggplot2 for visualization
Use dplyr for summaries and transformations
Combine datasets with joins

Principles of Tidy Data (Hadley Wickham):
1. Each variable is a column
2. Each observation is a row
3. Each value is a cell

6.3 Pivoting: Long vs Wide

6.3.1 `pivot_longer()`

Converts wide data into long (tidy) format.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

table4a |>
  pivot_longer(cols = c(`1999`, `2000`),
               names_to = "year",
               values_to = "cases")

# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

6.3.2 `pivot_wider()`

Converts long data back into wide format.

table2 |>
  pivot_wider(names_from = type, values_from = count)

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

6.3.3 In-Class Exercise 1 – Pivoting

Use pivot_longer() to convert table4a to long format.
Use pivot_wider() on table2 to create separate columns for type.
Which format is easier to use with ggplot2 and dplyr?

6.4 Separating and Uniting Columns

6.4.1 `separate()`

Splits a column into multiple columns.

table3 |>
  separate(rate, into = c("cases", "population"), sep = "/")

# A tibble: 6 × 4
  country      year cases  population
  <chr>       <dbl> <chr>  <chr>     
1 Afghanistan  1999 745    19987071  
2 Afghanistan  2000 2666   20595360  
3 Brazil       1999 37737  172006362 
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

6.4.2 `unite()`

Combines multiple columns into one.

table5 |>
  unite(new, century, year, sep = "")

# A tibble: 6 × 3
  country     new   rate             
  <chr>       <chr> <chr>            
1 Afghanistan 1999  745/19987071     
2 Afghanistan 2000  2666/20595360    
3 Brazil      1999  37737/172006362  
4 Brazil      2000  80488/174504898  
5 China       1999  212258/1272915272
6 China       2000  213766/1280428583

6.4.3 In-Class Exercise 2 – Separate and Unite

Use separate() to split the rate column in table3.
Use unite() to combine century and year into one column.

6.5 Tidying a Real Dataset

The who dataset is messy: column names encode multiple variables.

Example tidying workflow:

who |>
  pivot_longer(cols = starts_with("new"),
               names_to = "key",
               values_to = "cases",
               values_drop_na = TRUE) |>
  separate(key, into = c("type", "sex_age"), sep = "_") |>
  separate(sex_age, into = c("sex", "age"), sep = 1)

Warning: Expected 2 pieces. Additional pieces discarded in 73466 rows [1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

# A tibble: 76,046 × 8
   country     iso2  iso3   year type  sex   age   cases
   <chr>       <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
 1 Afghanistan AF    AFG    1997 new   s     p         0
 2 Afghanistan AF    AFG    1997 new   s     p        10
 3 Afghanistan AF    AFG    1997 new   s     p         6
 4 Afghanistan AF    AFG    1997 new   s     p         3
 5 Afghanistan AF    AFG    1997 new   s     p         5
 6 Afghanistan AF    AFG    1997 new   s     p         2
 7 Afghanistan AF    AFG    1997 new   s     p         0
 8 Afghanistan AF    AFG    1997 new   s     p         5
 9 Afghanistan AF    AFG    1997 new   s     p        38
10 Afghanistan AF    AFG    1997 new   s     p        36
# ℹ 76,036 more rows

6.5.1 In-Class Exercise 3 – WHO Dataset

Pivot who longer to create key and cases.
Separate key into multiple components.
Count total cases by country.
Which country has the highest reported cases?

6.6 Tidy Data Workflow

After tidying, you can:

Use ggplot2 for visualizations
Use group_by() and summarize() for summaries
Join with other datasets

6.7 Homework Preview

For homework, you will:

Take a messy dataset (e.g., table4a, table5, or your own)
Use pivot_longer() and/or pivot_wider() to reshape it
Use separate() and unite() as needed
Produce a tidy dataset and create one visualization and one grouped summary
Render to PDF and submit on Canvas