14  Accessing Data: Spreadsheets, Databases, Arrow, JSON, and Web Scraping

14.1 Learning Objectives

By the end of this lecture, you should be able to:

  • Import and work with data from Excel and Google Sheets
  • Connect to and query relational databases from R
  • Use Arrow to work efficiently with parquet files and large datasets
  • Access and tidy hierarchical JSON data
  • Perform basic web scraping to extract data from web pages

14.2 Spreadsheets (R4DS Chapter 20)

R can read Excel files with the readxl package and Google Sheets with googlesheets4.

14.2.1 Importing Excel

library(readxl)

excel_df <- read_excel("data/example.xlsx", sheet = "Sheet1")
head(excel_df)

14.2.2 Importing Google Sheets

library(googlesheets4)
sheet_url <- "https://docs.google.com/spreadsheets/d/your-sheet-id/edit#gid=0"
gs_df <- read_sheet(sheet_url)

14.2.3 In-Class Exercise 1 – Spreadsheets

  1. Read an Excel file from the course data folder.
  2. Load a Google Sheet you create (optional, requires authentication).
  3. Summarize one numeric column.

14.3 Databases (R4DS Chapter 21)

Use DBI and RSQLite to interact with relational databases. You can also use dplyr verbs to query tables.

14.3.1 Example: Connecting to SQLite

library(DBI)
con <- dbConnect(RSQLite::SQLite(), "data/mydb.sqlite")

# List tables
dbListTables(con)

# Read a table into R
flights_db <- dbReadTable(con, "flights")

# Or use dplyr to query lazily
library(dplyr)
tbl(con, "flights") |> filter(dep_delay > 60) |> collect() |> head()

14.3.2 In-Class Exercise 2 – Databases

  1. Connect to the provided SQLite database.
  2. List tables with dbListTables().
  3. Query the flights table for flights delayed more than 2 hours.

14.4 Arrow (R4DS Chapter 22)

Arrow allows you to read parquet files efficiently without loading everything into memory.

14.4.1 Example: Reading Parquet

library(arrow)

dataset <- open_dataset("data/large.parquet")
dataset |> filter(column_x > 10) |> collect() |> head()

14.4.2 In-Class Exercise 3 – Arrow

  1. Open a parquet dataset using arrow::open_dataset().
  2. Run a filter and select query.
  3. Compare performance to reading the equivalent CSV.

14.5 Hierarchical Data (R4DS Chapter 23)

Hierarchical data (JSON) often contains nested lists. Use jsonlite to load JSON and tidyr::unnest_wider() to flatten it.

14.5.1 Example: Reading JSON

library(jsonlite)

json_data <- fromJSON("data/example.json")
str(json_data)

14.5.2 Flattening Nested Data

library(tidyr)
nested_df <- tibble(
  id = 1,
  details = list(tibble(city = "NYC", temp = 75))
)

nested_df |> unnest_wider(details)
# A tibble: 1 × 3
     id city   temp
  <dbl> <chr> <dbl>
1     1 NYC      75

14.5.3 In-Class Exercise 4 – JSON Rectangling

  1. Load a nested JSON file.
  2. Use unnest_wider() or unnest_longer() to flatten it.
  3. Create a tidy table with one row per observation.

14.6 Web Scraping (R4DS Chapter 24)

Web scraping extracts data from websites. Use rvest to read HTML and extract tables or nodes.

14.6.1 Example: Scraping a Table

library(rvest)

url <- "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
page <- read_html(url)

gdp_table <- page |> html_element("table") |> html_table()
head(gdp_table)
# A tibble: 2 × 1
  X1                                                                            
  <chr>                                                                         
1 ""                                                                            
2 "Largest economies in the world by GDP (nominal) in 2025according to Internat…

14.6.2 In-Class Exercise 5 – Web Scraping

  1. Use rvest to scrape a simple table from Wikipedia.
  2. Convert it to a tibble and clean column names.
  3. Create a plot of GDP vs. rank.

14.7 In-Class Challenge – Multiple Data Sources

  1. Import an Excel dataset, a JSON dataset, and scrape a table from the web.
  2. Clean and join at least two sources.
  3. Create one visualization combining information.

14.8 Homework Preview

For the next homework:

  • Choose two different data sources (Excel, database, parquet, JSON, web)
  • Import and tidy them
  • Join or compare across sources
  • Render a short report with one plot and one table
  • Submit the rendered PDF

14.9 Conclusion

This session completes the course by showing how to access data from multiple modern sources, preparing you to work with real-world messy data beyond flat CSV files.