7 Exploratory Data Analysis (EDA)

7.1 Learning Objectives

By the end of this chapter, you should be able to:

Understand the purpose of exploratory data analysis (EDA)
Visualize distributions of single variables
Examine relationships between variables
Detect patterns, clusters, and outliers
Use transformations to clarify patterns

7.2 Introduction to EDA

Exploratory Data Analysis (EDA) is about looking at your data to find patterns, spot anomalies, and guide your next steps.
We use ggplot2 to visualize both univariate and bivariate relationships.

We will use the diamonds dataset.

7.3 Visualizing Single Variables

7.3.1 Categorical Variables

Use a bar chart (geom_bar()):

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

7.3.2 Continuous Variables

Use a histogram (geom_histogram()):

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

You can also use geom_freqpoly() for density curves.

7.3.3 In-Class Exercise 1 – Single Variables

Plot the distribution of color using a bar chart.
Plot a histogram of price with a binwidth of 1000.
What patterns or anomalies do you see?

7.4 Visualizing Relationships

7.4.1 Two Continuous Variables

Scatterplots show relationships:

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price), alpha = 0.3)

Use alpha to reduce overplotting.

7.4.2 Categorical vs. Continuous

Boxplots work well:

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = cut, y = price))

7.4.3 In-Class Exercise 2 – Relationships

Create a scatterplot of carat vs price.
Color the points by cut.
Make a boxplot of price across diamond color categories.

7.5 Patterns and Outliers

Look for clusters, gaps, and unusual observations.
You can filter or highlight outliers.

Example: filter diamonds with unusually high price:

diamonds |>
  filter(price > 15000) |>
  arrange(desc(price)) |>
  head()

# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24

7.5.1 Transformations

Log transformations can reveal patterns in skewed data.

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price)) +
  scale_y_log10()

7.5.2 In-Class Exercise 3 – Patterns and Transformations

Identify any outliers in the diamonds dataset using filters.
Apply a log transformation to price.
Does the relationship between carat and price become clearer?

7.6 Combining EDA with dplyr

Use filter(), mutate(), and group_by() to enhance your plots.

Example: average price per cut:

diamonds |>
  group_by(cut) |>
  summarize(mean_price = mean(price))

# A tibble: 5 × 2
  cut       mean_price
  <ord>          <dbl>
1 Fair           4359.
2 Good           3929.
3 Very Good      3982.
4 Premium        4584.
5 Ideal          3458.

7.6.1 In-Class Challenge – EDA Workflow

Explore diamonds by:
- Visualizing distributions of at least two variables
- Plotting relationships between two variables
- Detecting outliers
- Applying a transformation to clarify a pattern

7.7 Homework Preview

For homework, you will:

Choose a dataset (e.g., diamonds or your own)
Create at least two univariate visualizations (bar chart, histogram)
Create at least two bivariate visualizations (scatterplot, boxplot)
Identify any patterns or outliers and describe them in text
Apply at least one transformation to improve visualization
Render to PDF and submit

7.8 Next Steps

Next week, we will dive into Tidy Data and learn how to reshape messy datasets using tidyr.

--- title: "Exploratory Data Analysis (EDA)" --- ## Learning Objectives By the end of this chapter, you should be able to: - Understand the purpose of exploratory data analysis (EDA) - Visualize distributions of single variables - Examine relationships between variables - Detect patterns, clusters, and outliers - Use transformations to clarify patterns ------------------------------------------------------------------------ ## Introduction to EDA Exploratory Data Analysis (EDA) is about **looking at your data** to find patterns, spot anomalies, and guide your next steps.\ We use `ggplot2` to visualize both **univariate** and **bivariate** relationships. We will use the `diamonds` dataset. ------------------------------------------------------------------------ ## Visualizing Single Variables ### Categorical Variables Use a bar chart (`geom_bar()`): ```{r} library(tidyverse) ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut)) ``` ------------------------------------------------------------------------ ### Continuous Variables Use a histogram (`geom_histogram()`): ```{r} ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat), binwidth = 0.5) ``` You can also use `geom_freqpoly()` for density curves. ------------------------------------------------------------------------ ### In-Class Exercise 1 – Single Variables 1. Plot the distribution of `color` using a bar chart.\ 2. Plot a histogram of `price` with a binwidth of 1000.\ 3. What patterns or anomalies do you see? ------------------------------------------------------------------------ ## Visualizing Relationships ### Two Continuous Variables Scatterplots show relationships: ```{r} ggplot(data = diamonds) + geom_point(mapping = aes(x = carat, y = price), alpha = 0.3) ``` Use `alpha` to reduce overplotting. ------------------------------------------------------------------------ ### Categorical vs. Continuous Boxplots work well: ```{r} ggplot(data = diamonds) + geom_boxplot(mapping = aes(x = cut, y = price)) ``` ------------------------------------------------------------------------ ### In-Class Exercise 2 – Relationships 1. Create a scatterplot of `carat` vs `price`.\ 2. Color the points by `cut`.\ 3. Make a boxplot of `price` across diamond `color` categories. ------------------------------------------------------------------------ ## Patterns and Outliers Look for clusters, gaps, and unusual observations.\ You can **filter** or **highlight** outliers. Example: filter diamonds with unusually high price: ```{r} diamonds |> filter(price > 15000) |> arrange(desc(price)) |> head() ``` ------------------------------------------------------------------------ ### Transformations Log transformations can reveal patterns in skewed data. ```{r} ggplot(data = diamonds) + geom_point(mapping = aes(x = carat, y = price)) + scale_y_log10() ``` ------------------------------------------------------------------------ ### In-Class Exercise 3 – Patterns and Transformations 1. Identify any outliers in the `diamonds` dataset using filters.\ 2. Apply a log transformation to `price`.\ 3. Does the relationship between `carat` and `price` become clearer? ------------------------------------------------------------------------ ## Combining EDA with dplyr Use `filter()`, `mutate()`, and `group_by()` to enhance your plots. Example: average price per cut: ```{r} diamonds |> group_by(cut) |> summarize(mean_price = mean(price)) ``` ------------------------------------------------------------------------ ### In-Class Challenge – EDA Workflow - Explore `diamonds` by: - Visualizing distributions of at least two variables - Plotting relationships between two variables - Detecting outliers - Applying a transformation to clarify a pattern ------------------------------------------------------------------------ ## Homework Preview For homework, you will: - Choose a dataset (e.g., `diamonds` or your own)\ - Create at least **two univariate** visualizations (bar chart, histogram)\ - Create at least **two bivariate** visualizations (scatterplot, boxplot)\ - Identify any patterns or outliers and describe them in text\ - Apply at least one transformation to improve visualization\ - Render to PDF and submit ------------------------------------------------------------------------ ## Next Steps Next week, we will dive into **Tidy Data** and learn how to reshape messy datasets using `tidyr`.