7  Exploratory Data Analysis (EDA)

7.1 Learning Objectives

By the end of this chapter, you should be able to:

  • Understand the purpose of exploratory data analysis (EDA)
  • Visualize distributions of single variables
  • Examine relationships between variables
  • Detect patterns, clusters, and outliers
  • Use transformations to clarify patterns

7.2 Introduction to EDA

Exploratory Data Analysis (EDA) is about looking at your data to find patterns, spot anomalies, and guide your next steps.
We use ggplot2 to visualize both univariate and bivariate relationships.

We will use the diamonds dataset.


7.3 Visualizing Single Variables

7.3.1 Categorical Variables

Use a bar chart (geom_bar()):

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))


7.3.2 Continuous Variables

Use a histogram (geom_histogram()):

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

You can also use geom_freqpoly() for density curves.


7.3.3 In-Class Exercise 1 – Single Variables

  1. Plot the distribution of color using a bar chart.
  2. Plot a histogram of price with a binwidth of 1000.
  3. What patterns or anomalies do you see?

7.4 Visualizing Relationships

7.4.1 Two Continuous Variables

Scatterplots show relationships:

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price), alpha = 0.3)

Use alpha to reduce overplotting.


7.4.2 Categorical vs. Continuous

Boxplots work well:

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = cut, y = price))


7.4.3 In-Class Exercise 2 – Relationships

  1. Create a scatterplot of carat vs price.
  2. Color the points by cut.
  3. Make a boxplot of price across diamond color categories.

7.5 Patterns and Outliers

Look for clusters, gaps, and unusual observations.
You can filter or highlight outliers.

Example: filter diamonds with unusually high price:

diamonds |>
  filter(price > 15000) |>
  arrange(desc(price)) |>
  head()
# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24

7.5.1 Transformations

Log transformations can reveal patterns in skewed data.

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price)) +
  scale_y_log10()


7.5.2 In-Class Exercise 3 – Patterns and Transformations

  1. Identify any outliers in the diamonds dataset using filters.
  2. Apply a log transformation to price.
  3. Does the relationship between carat and price become clearer?

7.6 Combining EDA with dplyr

Use filter(), mutate(), and group_by() to enhance your plots.

Example: average price per cut:

diamonds |>
  group_by(cut) |>
  summarize(mean_price = mean(price))
# A tibble: 5 × 2
  cut       mean_price
  <ord>          <dbl>
1 Fair           4359.
2 Good           3929.
3 Very Good      3982.
4 Premium        4584.
5 Ideal          3458.

7.6.1 In-Class Challenge – EDA Workflow

  • Explore diamonds by:
    • Visualizing distributions of at least two variables
    • Plotting relationships between two variables
    • Detecting outliers
    • Applying a transformation to clarify a pattern

7.7 Homework Preview

For homework, you will:

  • Choose a dataset (e.g., diamonds or your own)
  • Create at least two univariate visualizations (bar chart, histogram)
  • Create at least two bivariate visualizations (scatterplot, boxplot)
  • Identify any patterns or outliers and describe them in text
  • Apply at least one transformation to improve visualization
  • Render to PDF and submit

7.8 Next Steps

Next week, we will dive into Tidy Data and learn how to reshape messy datasets using tidyr.