9  Data Import with readr

9.1 Learning Objectives

By the end of this chapter, you should be able to:

  • Import CSV and TSV files with readr
  • Understand column types
  • Diagnose and fix import problems

9.2 Reading CSV and TSV Files

The readr package (part of the tidyverse) provides fast and friendly functions for reading text data.

9.2.1 Example: Reading a CSV file

Using read_csv(), give the path of the file. For instance, you download a file to your data folder.

### students <- read_csv("data/students.csv")

Here, we will just use the url:

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
students <- read_csv("https://pos.it/r4ds-students-csv")
Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Full Name, favourite.food, mealPlan, AGE
dbl (1): Student ID

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
students
# A tibble: 6 × 5
  `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
         <dbl> <chr>            <chr>              <chr>               <chr>
1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
2            2 Barclay Lynn     French fries       Lunch only          5    
3            3 Jayendra Lyne    N/A                Breakfast and lunch 7    
4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
6            6 Güvenç Attila    Ice cream          Lunch only          6    

Practical Advice: Look at your data so you can transform it. See the “N/A”s?

students <- read_csv("https://pos.it/r4ds-students-csv", na = c("N/A", ""))
Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Full Name, favourite.food, mealPlan, AGE
dbl (1): Student ID

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
students
# A tibble: 6 × 5
  `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
         <dbl> <chr>            <chr>              <chr>               <chr>
1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
2            2 Barclay Lynn     French fries       Lunch only          5    
3            3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
6            6 Güvenç Attila    Ice cream          Lunch only          6    

Type students at your R prompt. Notice that Student ID and Full Name have backticks.

students |> 
  rename(
    student_id = `Student ID`,
    full_name = `Full Name`
  )
# A tibble: 6 × 5
  student_id full_name        favourite.food     mealPlan            AGE  
       <dbl> <chr>            <chr>              <chr>               <chr>
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
2          2 Barclay Lynn     French fries       Lunch only          5    
3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
6          6 Güvenç Attila    Ice cream          Lunch only          6    

9.2.2 Other arguments

read_csv(
  "a,b,c
  1,2,3
  4,5,6"
)
Rows: 2 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): a, b, c

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 2 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

You can read a CSV and skip the first few lines if they contain metadata.

read_csv(
  "The first line of metadata
  The second line of metadata
  x,y,z
  1,2,3",
  skip = 2
)
Rows: 1 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): x, y, z

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 1 × 3
      x     y     z
  <dbl> <dbl> <dbl>
1     1     2     3

If there are no column names:

read_csv(
  "1,2,3
  4,5,6",
  col_names = FALSE
)
Rows: 2 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): X1, X2, X3

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 2 × 3
     X1    X2    X3
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

Or, add column names:

read_csv(
  "1,2,3
  4,5,6",
  col_names = c("x", "y", "z")
)
Rows: 2 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): x, y, z

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 2 × 3
      x     y     z
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

9.2.3 Also: Reading a TSV file

### df_tsv <- read_tsv("data/example.tsv")

9.2.4 In-Class Exercise 1 – CSV

  1. Download the students data (e.g., from the course repository).
  2. Read it into R using read_csv().
  3. Inspect its structure with glimpse().
  4. What data types were automatically detected?
  5. Notice the tick marks around some column names.
  6. Use janitor::clean_names() to turn all column names to snake case.

9.2.5 In-Class Challenge

Change the data type stored in the meal_plan column. Hint: you can use factor() with mutate().

9.3 Column Types and Parsing

readr automatically guesses column types, but you can override them.

read_csv("
  logical,numeric,date,string
  TRUE,1,2021-01-15,abc
  false,4.5,2021-02-15,def
  T,Inf,2021-02-16,ghi
")
Rows: 3 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): string
dbl  (1): numeric
lgl  (1): logical
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 3 × 4
  logical numeric date       string
  <lgl>     <dbl> <date>     <chr> 
1 TRUE        1   2021-01-15 abc   
2 FALSE       4.5 2021-02-15 def   
3 TRUE      Inf   2021-02-16 ghi   

9.3.1 But what if the data is not clean?

simple_csv <- "
  x
  10
  .
  20
  30"

Read it with read_csv() - see that it’s a character column?

read_csv(simple_csv)
Rows: 4 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): x

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 4 × 1
  x    
  <chr>
1 10   
2 .    
3 20   
4 30   

There is some missing data. So readr turns what should be a double or numeric column into a character column.

Tell readr that it is a numeric column with col_types.

The col_types function takes a named list where the names match the column names in the CSV file.

df <- read_csv(
  simple_csv, 
  col_types = list(x = col_double())
)
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Read the warning. Use problems to find out what that’s about:

# A tibble: 1 × 5
    row   col expected actual file                                              
  <int> <int> <chr>    <chr>  <chr>                                             
1     3     1 a double .      /private/var/folders/2n/pg5xtmj97p51tsr0ch_1jkqw6…

We can make . indicate missing data using na = ".":

read_csv(simple_csv, na = ".")
Rows: 4 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): x

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 4 × 1
      x
  <dbl>
1    10
2    NA
3    20
4    30

Other column type functions in readr are col_logical(), col_integer(), col_character(), col_factor(), col_date(), col_datetime, and col_number().

You can parse numbers with parse_number(), dates with parse_date(), and times with parse_time().

9.3.2 In-Class Exercise 2

  1. Load a simple CSV file with simple_csv().
  2. Use col_types() to change a data type of a column.
  3. Use problems() to diagnose a warning.
  4. Use na = to properly code missing data.

9.3.3 In-Class Challenge

  1. Read a TSV file from PanTHERIA: “https://esapubs.org/archive/ecol/E090/184/PanTHERIA_1-0_WR93_Aug2008.txt”
  2. Use na = to change all -999.00s to missing data.

9.4 Writing to a file

You can use write_csv() or write_tsv().

write_csv(students, "students.csv")

9.4.1 In-Class Exercise 3

  1. Write a CSV (i.e. students) to a file students-2.csv
  2. Read that file in again with read_csv.
  3. Notice the data types of the columns didn’t pass through.
  4. Use write_rds() and read_rds to maintain those changes in R binary format.
write_rds(students, "students.rds")
read_rds("students.rds")
# A tibble: 6 × 5
  `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
         <dbl> <chr>            <chr>              <chr>               <chr>
1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
2            2 Barclay Lynn     French fries       Lunch only          5    
3            3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
6            6 Güvenç Attila    Ice cream          Lunch only          6    

9.5 Data Entry

You can do basic data entry in your R script.

By column with tibble():

tibble(
  x = c(1, 2, 5), 
  y = c("h", "m", "g"),
  z = c(0.08, 0.83, 0.60)
)
# A tibble: 3 × 3
      x y         z
  <dbl> <chr> <dbl>
1     1 h      0.08
2     2 m      0.83
3     5 g      0.6 

By row with tribble() (a transformed tibble):

tribble(
  ~x, ~y, ~z,
  1, "h", 0.08,
  2, "m", 0.83,
  5, "g", 0.60
)
# A tibble: 3 × 3
      x y         z
  <dbl> <chr> <dbl>
1     1 h      0.08
2     2 m      0.83
3     5 g      0.6 

9.6 Homework Preview

For homework, you will:

  • Import at least one CSV dataset
  • Fix any parsing issues (e.g., column types, dates)
  • Clean at least one column with mutate()
  • Provide a short summary (using group_by() and summarize())
  • Render to PDF and submit

9.7 Next Steps

Next week, we will learn to work with text data and regular expressions using the stringr package.