library(readxl)
excel_df <- read_excel("data/example.xlsx", sheet = "Sheet1")
head(excel_df)14 Accessing Data: Spreadsheets, Databases, Arrow, JSON, and Web Scraping
14.1 Learning Objectives
By the end of this lecture, you should be able to:
- Import and work with data from Excel and Google Sheets
- Connect to and query relational databases from R
- Use Arrow to work efficiently with parquet files and large datasets
- Access and tidy hierarchical JSON data
- Perform basic web scraping to extract data from web pages
14.2 Spreadsheets (R4DS Chapter 20)
R can read Excel files with the readxl package and Google Sheets with googlesheets4.
14.2.1 Importing Excel
14.2.2 Importing Google Sheets
library(googlesheets4)
sheet_url <- "https://docs.google.com/spreadsheets/d/your-sheet-id/edit#gid=0"
gs_df <- read_sheet(sheet_url)14.2.3 In-Class Exercise 1 – Spreadsheets
- Read an Excel file from the course data folder.
- Load a Google Sheet you create (optional, requires authentication).
- Summarize one numeric column.
14.3 Databases (R4DS Chapter 21)
Use DBI and RSQLite to interact with relational databases. You can also use dplyr verbs to query tables.
14.3.1 Example: Connecting to SQLite
14.3.2 In-Class Exercise 2 – Databases
- Connect to the provided SQLite database.
- List tables with
dbListTables().
- Query the flights table for flights delayed more than 2 hours.
14.4 Arrow (R4DS Chapter 22)
Arrow allows you to read parquet files efficiently without loading everything into memory.
14.4.1 Example: Reading Parquet
14.4.2 In-Class Exercise 3 – Arrow
- Open a parquet dataset using
arrow::open_dataset().
- Run a filter and select query.
- Compare performance to reading the equivalent CSV.
14.5 Hierarchical Data (R4DS Chapter 23)
Hierarchical data (JSON) often contains nested lists. Use jsonlite to load JSON and tidyr::unnest_wider() to flatten it.
14.5.1 Example: Reading JSON
14.5.2 Flattening Nested Data
library(tidyr)
nested_df <- tibble(
id = 1,
details = list(tibble(city = "NYC", temp = 75))
)
nested_df |> unnest_wider(details)# A tibble: 1 × 3
id city temp
<dbl> <chr> <dbl>
1 1 NYC 75
14.5.3 In-Class Exercise 4 – JSON Rectangling
- Load a nested JSON file.
- Use
unnest_wider()orunnest_longer()to flatten it.
- Create a tidy table with one row per observation.
14.6 Web Scraping (R4DS Chapter 24)
Web scraping extracts data from websites. Use rvest to read HTML and extract tables or nodes.
14.6.1 Example: Scraping a Table
library(rvest)
url <- "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
page <- read_html(url)
gdp_table <- page |> html_element("table") |> html_table()
head(gdp_table)# A tibble: 2 × 1
X1
<chr>
1 ""
2 "Largest economies in the world by GDP (nominal) in 2025according to Internat…
14.6.2 In-Class Exercise 5 – Web Scraping
- Use
rvestto scrape a simple table from Wikipedia.
- Convert it to a tibble and clean column names.
- Create a plot of GDP vs. rank.
14.7 In-Class Challenge – Multiple Data Sources
- Import an Excel dataset, a JSON dataset, and scrape a table from the web.
- Clean and join at least two sources.
- Create one visualization combining information.
14.8 Homework Preview
For the next homework:
- Choose two different data sources (Excel, database, parquet, JSON, web)
- Import and tidy them
- Join or compare across sources
- Render a short report with one plot and one table
- Submit the rendered PDF
14.9 Conclusion
This session completes the course by showing how to access data from multiple modern sources, preparing you to work with real-world messy data beyond flat CSV files.