14 Accessing Data: Spreadsheets, Databases, Arrow, JSON, and Web Scraping

14.1 Learning Objectives

By the end of this lecture, you should be able to:

Import and work with data from Excel and Google Sheets
Connect to and query relational databases from R
Use Arrow to work efficiently with parquet files and large datasets
Access and tidy hierarchical JSON data
Perform basic web scraping to extract data from web pages

14.2 Spreadsheets (R4DS Chapter 20)

R can read Excel files with the readxl package and Google Sheets with googlesheets4.

14.2.1 Importing Excel

library(readxl)

excel_df <- read_excel("data/example.xlsx", sheet = "Sheet1")
head(excel_df)

14.2.2 Importing Google Sheets

library(googlesheets4)
sheet_url <- "https://docs.google.com/spreadsheets/d/your-sheet-id/edit#gid=0"
gs_df <- read_sheet(sheet_url)

14.2.3 In-Class Exercise 1 – Spreadsheets

Read an Excel file from the course data folder.
Load a Google Sheet you create (optional, requires authentication).
Summarize one numeric column.

14.3 Databases (R4DS Chapter 21)

Use DBI and RSQLite to interact with relational databases. You can also use dplyr verbs to query tables.

14.3.1 Example: Connecting to SQLite

library(DBI)
con <- dbConnect(RSQLite::SQLite(), "data/mydb.sqlite")

# List tables
dbListTables(con)

# Read a table into R
flights_db <- dbReadTable(con, "flights")

# Or use dplyr to query lazily
library(dplyr)
tbl(con, "flights") |> filter(dep_delay > 60) |> collect() |> head()

14.3.2 In-Class Exercise 2 – Databases

Connect to the provided SQLite database.
List tables with dbListTables().
Query the flights table for flights delayed more than 2 hours.

14.4 Arrow (R4DS Chapter 22)

Arrow allows you to read parquet files efficiently without loading everything into memory.

14.4.1 Example: Reading Parquet

library(arrow)

dataset <- open_dataset("data/large.parquet")
dataset |> filter(column_x > 10) |> collect() |> head()

14.4.2 In-Class Exercise 3 – Arrow

Open a parquet dataset using arrow::open_dataset().
Run a filter and select query.
Compare performance to reading the equivalent CSV.

14.5 Hierarchical Data (R4DS Chapter 23)

Hierarchical data (JSON) often contains nested lists. Use jsonlite to load JSON and tidyr::unnest_wider() to flatten it.

14.5.1 Example: Reading JSON

library(jsonlite)

json_data <- fromJSON("data/example.json")
str(json_data)

14.5.2 Flattening Nested Data

library(tidyr)
nested_df <- tibble(
  id = 1,
  details = list(tibble(city = "NYC", temp = 75))
)

nested_df |> unnest_wider(details)

# A tibble: 1 × 3
     id city   temp
  <dbl> <chr> <dbl>
1     1 NYC      75

14.5.3 In-Class Exercise 4 – JSON Rectangling

Load a nested JSON file.
Use unnest_wider() or unnest_longer() to flatten it.
Create a tidy table with one row per observation.

14.6 Web Scraping (R4DS Chapter 24)

Web scraping extracts data from websites. Use rvest to read HTML and extract tables or nodes.

14.6.1 Example: Scraping a Table

library(rvest)

url <- "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
page <- read_html(url)

gdp_table <- page |> html_element("table") |> html_table()
head(gdp_table)

# A tibble: 2 × 1
  X1                                                                            
  <chr>                                                                         
1 ""                                                                            
2 "Largest economies in the world by GDP (nominal) in 2025according to Internat…

14.6.2 In-Class Exercise 5 – Web Scraping

Use rvest to scrape a simple table from Wikipedia.
Convert it to a tibble and clean column names.
Create a plot of GDP vs. rank.

14.7 In-Class Challenge – Multiple Data Sources

Import an Excel dataset, a JSON dataset, and scrape a table from the web.
Clean and join at least two sources.
Create one visualization combining information.

14.8 Homework Preview

For the next homework:

Choose two different data sources (Excel, database, parquet, JSON, web)
Import and tidy them
Join or compare across sources
Render a short report with one plot and one table
Submit the rendered PDF

14.9 Conclusion

This session completes the course by showing how to access data from multiple modern sources, preparing you to work with real-world messy data beyond flat CSV files.

--- title: "Accessing Data: Spreadsheets, Databases, Arrow, JSON, and Web Scraping" --- ## Learning Objectives By the end of this lecture, you should be able to: - Import and work with data from Excel and Google Sheets - Connect to and query relational databases from R - Use Arrow to work efficiently with parquet files and large datasets - Access and tidy hierarchical JSON data - Perform basic web scraping to extract data from web pages ------------------------------------------------------------------------ ## Spreadsheets (R4DS Chapter 20) R can read Excel files with the `readxl` package and Google Sheets with `googlesheets4`. ### Importing Excel ```{r, eval=FALSE} library(readxl) excel_df <- read_excel("data/example.xlsx", sheet = "Sheet1") head(excel_df) ``` ### Importing Google Sheets ```{r eval=FALSE} library(googlesheets4) sheet_url <- "https://docs.google.com/spreadsheets/d/your-sheet-id/edit#gid=0" gs_df <- read_sheet(sheet_url) ``` ------------------------------------------------------------------------ ### In-Class Exercise 1 – Spreadsheets 1. Read an Excel file from the course data folder.\ 2. Load a Google Sheet you create (optional, requires authentication).\ 3. Summarize one numeric column. ------------------------------------------------------------------------ ## Databases (R4DS Chapter 21) Use `DBI` and `RSQLite` to interact with relational databases. You can also use `dplyr` verbs to query tables. ### Example: Connecting to SQLite ```{r, eval=FALSE} library(DBI) con <- dbConnect(RSQLite::SQLite(), "data/mydb.sqlite") # List tables dbListTables(con) # Read a table into R flights_db <- dbReadTable(con, "flights") # Or use dplyr to query lazily library(dplyr) tbl(con, "flights") |> filter(dep_delay > 60) |> collect() |> head() ``` ------------------------------------------------------------------------ ### In-Class Exercise 2 – Databases 1. Connect to the provided SQLite database.\ 2. List tables with `dbListTables()`.\ 3. Query the flights table for flights delayed more than 2 hours. ------------------------------------------------------------------------ ## Arrow (R4DS Chapter 22) Arrow allows you to read parquet files efficiently without loading everything into memory. ### Example: Reading Parquet ```{r, eval=FALSE} library(arrow) dataset <- open_dataset("data/large.parquet") dataset |> filter(column_x > 10) |> collect() |> head() ``` ------------------------------------------------------------------------ ### In-Class Exercise 3 – Arrow 1. Open a parquet dataset using `arrow::open_dataset()`.\ 2. Run a filter and select query.\ 3. Compare performance to reading the equivalent CSV. ------------------------------------------------------------------------ ## Hierarchical Data (R4DS Chapter 23) Hierarchical data (JSON) often contains nested lists. Use `jsonlite` to load JSON and `tidyr::unnest_wider()` to flatten it. ### Example: Reading JSON ```{r, eval=FALSE} library(jsonlite) json_data <- fromJSON("data/example.json") str(json_data) ``` ### Flattening Nested Data ```{r} library(tidyr) nested_df <- tibble( id = 1, details = list(tibble(city = "NYC", temp = 75)) ) nested_df |> unnest_wider(details) ``` ------------------------------------------------------------------------ ### In-Class Exercise 4 – JSON Rectangling 1. Load a nested JSON file.\ 2. Use `unnest_wider()` or `unnest_longer()` to flatten it.\ 3. Create a tidy table with one row per observation. ------------------------------------------------------------------------ ## Web Scraping (R4DS Chapter 24) Web scraping extracts data from websites. Use `rvest` to read HTML and extract tables or nodes. ### Example: Scraping a Table ```{r} library(rvest) url <- "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)" page <- read_html(url) gdp_table <- page |> html_element("table") |> html_table() head(gdp_table) ``` ------------------------------------------------------------------------ ### In-Class Exercise 5 – Web Scraping 1. Use `rvest` to scrape a simple table from Wikipedia.\ 2. Convert it to a tibble and clean column names.\ 3. Create a plot of GDP vs. rank. ------------------------------------------------------------------------ ## In-Class Challenge – Multiple Data Sources 1. Import an Excel dataset, a JSON dataset, and scrape a table from the web.\ 2. Clean and join at least two sources.\ 3. Create one visualization combining information. ------------------------------------------------------------------------ ## Homework Preview For the next homework: - Choose **two different data sources** (Excel, database, parquet, JSON, web) - Import and tidy them - Join or compare across sources - Render a short report with one plot and one table - Submit the rendered PDF ------------------------------------------------------------------------ ## Conclusion This session completes the course by showing how to **access data from multiple modern sources**, preparing you to work with **real-world messy data** beyond flat CSV files.