9 Data Import with readr

9.1 Learning Objectives

By the end of this chapter, you should be able to:

Import CSV and TSV files with readr
Understand column types
Diagnose and fix import problems

9.2 Reading CSV and TSV Files

The readr package (part of the tidyverse) provides fast and friendly functions for reading text data.

9.2.1 Example: Reading a CSV file

Using read_csv(), give the path of the file. For instance, you download a file to your data folder.

### students <- read_csv("data/students.csv")

Here, we will just use the url:

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

students <- read_csv("https://pos.it/r4ds-students-csv")

Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Full Name, favourite.food, mealPlan, AGE
dbl (1): Student ID

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

students

# A tibble: 6 × 5
  `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
         <dbl> <chr>            <chr>              <chr>               <chr>
1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
2            2 Barclay Lynn     French fries       Lunch only          5    
3            3 Jayendra Lyne    N/A                Breakfast and lunch 7    
4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
6            6 Güvenç Attila    Ice cream          Lunch only          6

Practical Advice: Look at your data so you can transform it. See the “N/A”s?

students <- read_csv("https://pos.it/r4ds-students-csv", na = c("N/A", ""))

Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Full Name, favourite.food, mealPlan, AGE
dbl (1): Student ID

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

students

# A tibble: 6 × 5
  `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
         <dbl> <chr>            <chr>              <chr>               <chr>
1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
2            2 Barclay Lynn     French fries       Lunch only          5    
3            3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
6            6 Güvenç Attila    Ice cream          Lunch only          6

Type students at your R prompt. Notice that Student ID and Full Name have backticks.

students |> 
  rename(
    student_id = `Student ID`,
    full_name = `Full Name`
  )

# A tibble: 6 × 5
  student_id full_name        favourite.food     mealPlan            AGE  
       <dbl> <chr>            <chr>              <chr>               <chr>
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
2          2 Barclay Lynn     French fries       Lunch only          5    
3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
6          6 Güvenç Attila    Ice cream          Lunch only          6

9.2.2 Other arguments

read_csv(
  "a,b,c
  1,2,3
  4,5,6"
)

Rows: 2 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): a, b, c

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 2 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

You can read a CSV and skip the first few lines if they contain metadata.

read_csv(
  "The first line of metadata
  The second line of metadata
  x,y,z
  1,2,3",
  skip = 2
)

Rows: 1 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): x, y, z

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 1 × 3
      x     y     z
  <dbl> <dbl> <dbl>
1     1     2     3

If there are no column names:

read_csv(
  "1,2,3
  4,5,6",
  col_names = FALSE
)

Rows: 2 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): X1, X2, X3

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 2 × 3
     X1    X2    X3
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

Or, add column names:

read_csv(
  "1,2,3
  4,5,6",
  col_names = c("x", "y", "z")
)

Rows: 2 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): x, y, z

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 2 × 3
      x     y     z
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

9.2.3 Also: Reading a TSV file

### df_tsv <- read_tsv("data/example.tsv")

9.2.4 In-Class Exercise 1 – CSV

Download the students data (e.g., from the course repository).
Read it into R using read_csv().
Inspect its structure with glimpse().
What data types were automatically detected?
Notice the tick marks around some column names.
Use janitor::clean_names() to turn all column names to snake case.

9.2.5 In-Class Challenge

Change the data type stored in the meal_plan column. Hint: you can use factor() with mutate().

9.3 Column Types and Parsing

readr automatically guesses column types, but you can override them.

read_csv("
  logical,numeric,date,string
  TRUE,1,2021-01-15,abc
  false,4.5,2021-02-15,def
  T,Inf,2021-02-16,ghi
")

Rows: 3 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): string
dbl  (1): numeric
lgl  (1): logical
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 3 × 4
  logical numeric date       string
  <lgl>     <dbl> <date>     <chr> 
1 TRUE        1   2021-01-15 abc   
2 FALSE       4.5 2021-02-15 def   
3 TRUE      Inf   2021-02-16 ghi

9.3.1 But what if the data is not clean?

simple_csv <- "
  x
  10
  .
  20
  30"

Read it with read_csv() - see that it’s a character column?

read_csv(simple_csv)

Rows: 4 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): x

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 4 × 1
  x    
  <chr>
1 10   
2 .    
3 20   
4 30

There is some missing data. So readr turns what should be a double or numeric column into a character column.

Tell readr that it is a numeric column with col_types.

The col_types function takes a named list where the names match the column names in the CSV file.

df <- read_csv(
  simple_csv, 
  col_types = list(x = col_double())
)

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Read the warning. Use problems to find out what that’s about:

problems(df)

# A tibble: 1 × 5
    row   col expected actual file                                              
  <int> <int> <chr>    <chr>  <chr>                                             
1     3     1 a double .      /private/var/folders/2n/pg5xtmj97p51tsr0ch_1jkqw6…

We can make . indicate missing data using na = ".":

read_csv(simple_csv, na = ".")

Rows: 4 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): x

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 4 × 1
      x
  <dbl>
1    10
2    NA
3    20
4    30

Other column type functions in readr are col_logical(), col_integer(), col_character(), col_factor(), col_date(), col_datetime, and col_number().

You can parse numbers with parse_number(), dates with parse_date(), and times with parse_time().

9.3.2 In-Class Exercise 2

Load a simple CSV file with simple_csv().
Use col_types() to change a data type of a column.
Use problems() to diagnose a warning.
Use na = to properly code missing data.

9.3.3 In-Class Challenge

Read a TSV file from PanTHERIA: “https://esapubs.org/archive/ecol/E090/184/PanTHERIA_1-0_WR93_Aug2008.txt”
Use na = to change all -999.00s to missing data.

9.4 Writing to a file

You can use write_csv() or write_tsv().

write_csv(students, "students.csv")

9.4.1 In-Class Exercise 3

Write a CSV (i.e. students) to a file students-2.csv
Read that file in again with read_csv.
Notice the data types of the columns didn’t pass through.
Use write_rds() and read_rds to maintain those changes in R binary format.

write_rds(students, "students.rds")
read_rds("students.rds")

# A tibble: 6 × 5
  `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
         <dbl> <chr>            <chr>              <chr>               <chr>
1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
2            2 Barclay Lynn     French fries       Lunch only          5    
3            3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
6            6 Güvenç Attila    Ice cream          Lunch only          6

9.5 Data Entry

You can do basic data entry in your R script.

By column with tibble():

tibble(
  x = c(1, 2, 5), 
  y = c("h", "m", "g"),
  z = c(0.08, 0.83, 0.60)
)

# A tibble: 3 × 3
      x y         z
  <dbl> <chr> <dbl>
1     1 h      0.08
2     2 m      0.83
3     5 g      0.6

By row with tribble() (a transformed tibble):

tribble(
  ~x, ~y, ~z,
  1, "h", 0.08,
  2, "m", 0.83,
  5, "g", 0.60
)

# A tibble: 3 × 3
      x y         z
  <dbl> <chr> <dbl>
1     1 h      0.08
2     2 m      0.83
3     5 g      0.6

9.6 Homework Preview

For homework, you will:

Import at least one CSV dataset
Fix any parsing issues (e.g., column types, dates)
Clean at least one column with mutate()
Provide a short summary (using group_by() and summarize())
Render to PDF and submit

9.7 Next Steps

Next week, we will learn to work with text data and regular expressions using the stringr package.