### students <- read_csv("data/students.csv")9 Data Import with readr
9.1 Learning Objectives
By the end of this chapter, you should be able to:
- Import CSV and TSV files with
readr - Understand column types
- Diagnose and fix import problems
9.2 Reading CSV and TSV Files
The readr package (part of the tidyverse) provides fast and friendly functions for reading text data.
9.2.1 Example: Reading a CSV file
Using read_csv(), give the path of the file. For instance, you download a file to your data folder.
Here, we will just use the url:
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
students <- read_csv("https://pos.it/r4ds-students-csv")Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Full Name, favourite.food, mealPlan, AGE
dbl (1): Student ID
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
students# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
Practical Advice: Look at your data so you can transform it. See the “N/A”s?
Rows: 6 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Full Name, favourite.food, mealPlan, AGE
dbl (1): Student ID
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
students# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne <NA> Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
Type students at your R prompt. Notice that Student ID and Full Name have backticks.
students |>
rename(
student_id = `Student ID`,
full_name = `Full Name`
)# A tibble: 6 × 5
student_id full_name favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne <NA> Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
9.2.2 Other arguments
read_csv(
"a,b,c
1,2,3
4,5,6"
)Rows: 2 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): a, b, c
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 2 × 3
a b c
<dbl> <dbl> <dbl>
1 1 2 3
2 4 5 6
You can read a CSV and skip the first few lines if they contain metadata.
read_csv(
"The first line of metadata
The second line of metadata
x,y,z
1,2,3",
skip = 2
)Rows: 1 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): x, y, z
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 1 × 3
x y z
<dbl> <dbl> <dbl>
1 1 2 3
If there are no column names:
read_csv(
"1,2,3
4,5,6",
col_names = FALSE
)Rows: 2 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): X1, X2, X3
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 2 × 3
X1 X2 X3
<dbl> <dbl> <dbl>
1 1 2 3
2 4 5 6
Or, add column names:
Rows: 2 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): x, y, z
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 2 × 3
x y z
<dbl> <dbl> <dbl>
1 1 2 3
2 4 5 6
9.2.3 Also: Reading a TSV file
### df_tsv <- read_tsv("data/example.tsv")9.2.4 In-Class Exercise 1 – CSV
- Download the
studentsdata (e.g., from the course repository).
- Read it into R using
read_csv().
- Inspect its structure with
glimpse().
- What data types were automatically detected?
- Notice the tick marks around some column names.
- Use
janitor::clean_names()to turn all column names to snake case.
9.2.5 In-Class Challenge
Change the data type stored in the meal_plan column. Hint: you can use factor() with mutate().
9.3 Column Types and Parsing
readr automatically guesses column types, but you can override them.
read_csv("
logical,numeric,date,string
TRUE,1,2021-01-15,abc
false,4.5,2021-02-15,def
T,Inf,2021-02-16,ghi
")Rows: 3 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): string
dbl (1): numeric
lgl (1): logical
date (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 3 × 4
logical numeric date string
<lgl> <dbl> <date> <chr>
1 TRUE 1 2021-01-15 abc
2 FALSE 4.5 2021-02-15 def
3 TRUE Inf 2021-02-16 ghi
9.3.1 But what if the data is not clean?
simple_csv <- "
x
10
.
20
30"Read it with read_csv() - see that it’s a character column?
read_csv(simple_csv)Rows: 4 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): x
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 4 × 1
x
<chr>
1 10
2 .
3 20
4 30
There is some missing data. So readr turns what should be a double or numeric column into a character column.
Tell readr that it is a numeric column with col_types.
The col_types function takes a named list where the names match the column names in the CSV file.
df <- read_csv(
simple_csv,
col_types = list(x = col_double())
)Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Read the warning. Use problems to find out what that’s about:
problems(df)# A tibble: 1 × 5
row col expected actual file
<int> <int> <chr> <chr> <chr>
1 3 1 a double . /private/var/folders/2n/pg5xtmj97p51tsr0ch_1jkqw6…
We can make . indicate missing data using na = ".":
read_csv(simple_csv, na = ".")Rows: 4 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): x
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 4 × 1
x
<dbl>
1 10
2 NA
3 20
4 30
Other column type functions in readr are col_logical(), col_integer(), col_character(), col_factor(), col_date(), col_datetime, and col_number().
You can parse numbers with parse_number(), dates with parse_date(), and times with parse_time().
9.3.2 In-Class Exercise 2
- Load a simple CSV file with
simple_csv().
- Use
col_types()to change a data type of a column.
- Use
problems()to diagnose a warning.
- Use
na =to properly code missing data.
9.3.3 In-Class Challenge
- Read a TSV file from PanTHERIA: “https://esapubs.org/archive/ecol/E090/184/PanTHERIA_1-0_WR93_Aug2008.txt”
- Use
na =to change all-999.00s to missing data.
9.4 Writing to a file
You can use write_csv() or write_tsv().
write_csv(students, "students.csv")9.4.1 In-Class Exercise 3
- Write a CSV (i.e.
students) to a filestudents-2.csv
- Read that file in again with
read_csv.
- Notice the data types of the columns didn’t pass through.
- Use
write_rds()andread_rdsto maintain those changes in R binary format.
# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne <NA> Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
9.5 Data Entry
You can do basic data entry in your R script.
By column with tibble():
# A tibble: 3 × 3
x y z
<dbl> <chr> <dbl>
1 1 h 0.08
2 2 m 0.83
3 5 g 0.6
By row with tribble() (a transformed tibble):
tribble(
~x, ~y, ~z,
1, "h", 0.08,
2, "m", 0.83,
5, "g", 0.60
)# A tibble: 3 × 3
x y z
<dbl> <chr> <dbl>
1 1 h 0.08
2 2 m 0.83
3 5 g 0.6
9.6 Homework Preview
For homework, you will:
- Import at least one CSV dataset
- Fix any parsing issues (e.g., column types, dates)
- Clean at least one column with
mutate() - Provide a short summary (using
group_by()andsummarize()) - Render to PDF and submit
9.7 Next Steps
Next week, we will learn to work with text data and regular expressions using the stringr package.