── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
11 Strings and Regular Expressions with stringr
11.1 Learning Objectives
By the end of this chapter, you should be able to:
- Manipulate strings using the
stringrpackage - Detect patterns with regular expressions (regex)
- Extract, replace, and split text
- Clean messy text data for analysis
11.2 Introduction to stringr
The stringr package provides consistent, simple functions for string operations.
Load the library:
11.3 Creating and Inspecting Strings
11.3.1 Simple string creation:
string1 <- "Here is a string"
string2 <- 'Single quotes work also'
string3 <- 'If I want to put a "quote" in a string, I use single quotes.'11.4 Creating strings from data
11.4.1 str_c()
This takes vectors as arguments and returns a character vector.
Here is a simple string:
x <- c("apple", "banana", "pear")
x[1] "apple" "banana" "pear"
We can combine similar strings with str_c:
str_c("apple", "banana", "pear")[1] "applebananapear"
str_c("apple ", "banana ", "pear ")[1] "apple banana pear "
str_c(fruit, " is tasty") [1] "apple is tasty" "apricot is tasty"
[3] "avocado is tasty" "banana is tasty"
[5] "bell pepper is tasty" "bilberry is tasty"
[7] "blackberry is tasty" "blackcurrant is tasty"
[9] "blood orange is tasty" "blueberry is tasty"
[11] "boysenberry is tasty" "breadfruit is tasty"
[13] "canary melon is tasty" "cantaloupe is tasty"
[15] "cherimoya is tasty" "cherry is tasty"
[17] "chili pepper is tasty" "clementine is tasty"
[19] "cloudberry is tasty" "coconut is tasty"
[21] "cranberry is tasty" "cucumber is tasty"
[23] "currant is tasty" "damson is tasty"
[25] "date is tasty" "dragonfruit is tasty"
[27] "durian is tasty" "eggplant is tasty"
[29] "elderberry is tasty" "feijoa is tasty"
[31] "fig is tasty" "goji berry is tasty"
[33] "gooseberry is tasty" "grape is tasty"
[35] "grapefruit is tasty" "guava is tasty"
[37] "honeydew is tasty" "huckleberry is tasty"
[39] "jackfruit is tasty" "jambul is tasty"
[41] "jujube is tasty" "kiwi fruit is tasty"
[43] "kumquat is tasty" "lemon is tasty"
[45] "lime is tasty" "loquat is tasty"
[47] "lychee is tasty" "mandarine is tasty"
[49] "mango is tasty" "mulberry is tasty"
[51] "nectarine is tasty" "nut is tasty"
[53] "olive is tasty" "orange is tasty"
[55] "pamelo is tasty" "papaya is tasty"
[57] "passionfruit is tasty" "peach is tasty"
[59] "pear is tasty" "persimmon is tasty"
[61] "physalis is tasty" "pineapple is tasty"
[63] "plum is tasty" "pomegranate is tasty"
[65] "pomelo is tasty" "purple mangosteen is tasty"
[67] "quince is tasty" "raisin is tasty"
[69] "rambutan is tasty" "raspberry is tasty"
[71] "redcurrant is tasty" "rock melon is tasty"
[73] "salal berry is tasty" "satsuma is tasty"
[75] "star fruit is tasty" "strawberry is tasty"
[77] "tamarillo is tasty" "tangerine is tasty"
[79] "ugli fruit is tasty" "watermelon is tasty"
11.4.2 Letters
str_length gives you the number of letters in a string. You can provide a vector with multiple strings:
str_length(c("x", "CS506: Data Wrangling and Management", NA))[1] 1 36 NA
11.4.3 In-Class Exercise 1 – Basic String Operations
- Create a vector of at least 5 words.
- Measure their lengths with
str_length().
- Concatenate them with the phrase
" is cool".
lore <- c("Pokemon", "Harry Potter", "LOTR", "Star Wars", "Rocky")
str_length(lore)[1] 7 12 4 9 5
str_c(lore, " is cool")[1] "Pokemon is cool" "Harry Potter is cool" "LOTR is cool"
[4] "Star Wars is cool" "Rocky is cool"
11.5 In-Class Challenge
- Install the
babynamespackage and load the library. Invokeglimpse()to understand the column names.
- Use
filter()to look at the longest names (those with exactly 15 letters).
- Use
count()to find the distribution of lengths of names (Hint: use thewtargument).
library(babynames)
babynames |>
filter(str_length(name) == 15) |>
count(name, wt = n) |>
arrange(desc(n))# A tibble: 34 × 2
name n
<chr> <int>
1 Franciscojavier 123
2 Christopherjohn 118
3 Johnchristopher 118
4 Christopherjame 108
5 Christophermich 52
6 Ryanchristopher 45
7 Mariadelosangel 28
8 Jonathanmichael 25
9 Christianjoseph 22
10 Christopherjose 22
# ℹ 24 more rows
11.6 Detecting Patterns with Regular Expressions
“Regex” are a powerful way to detecting and describing patterns in strings.
###str_view str_view can take a regular expression as its second argument.
Here it is with a literal character:
[6] │ bil<berry>
[7] │ black<berry>
[10] │ blue<berry>
[11] │ boysen<berry>
[19] │ cloud<berry>
[21] │ cran<berry>
[29] │ elder<berry>
[32] │ goji <berry>
[33] │ goose<berry>
[38] │ huckle<berry>
[50] │ mul<berry>
[70] │ rasp<berry>
[73] │ salal <berry>
[76] │ straw<berry>
But you can also use a punctuation character like ., +, *, [, ], and ?. These are called metacharacters.
. will match any character. "a." will match anything that is “a” followed by another character:
We can also look for anything with an “a”, followed by three letters, followed by an “e”:
[1] │ <apple>
[7] │ bl<ackbe>rry
[48] │ mand<arine>
[51] │ nect<arine>
[62] │ pine<apple>
[64] │ pomegr<anate>
[70] │ r<aspbe>rry
[73] │ sal<al be>rry
###str_detect() Returns TRUE if a pattern is found.
animals <- c("dog", "cat", "parrot", "cow")
str_detect(animals, "o")[1] TRUE FALSE TRUE TRUE
You can use regular expressions for more complex patterns.
Examples:
-
^a– starts with “a”
-
ing$– ends with “ing”
-
[0-9]+– one or more digits
animals <- c("ant", "bat", "cat", "dog")
str_detect(animals, "^a")[1] TRUE FALSE FALSE FALSE
11.6.1 In-Class Exercise 2 – Pattern Detection
- Create a vector of email-like strings.
- Use
str_detect()to check which contain"@".
- Write a regex to detect strings ending in
.com.
emails <- c("marc@nau.edu", "velma@gmail.com", "jack@yahoo.com", "sam@apple.com")
str_detect(emails, "@")[1] TRUE TRUE TRUE TRUE
str_detect(emails, '.com$')[1] FALSE TRUE TRUE TRUE
str_detect returns a logical vector that is TRUE if the pattern matches an element of the character vector and FALSE if it doesn’t.
str_detect(c("a", "b", "c"), "[aeiou]")[1] TRUE FALSE FALSE
This pairs quite well with filter():
babynames |>
filter(str_detect(name, "x")) |>
count(name, wt = n, sort = TRUE)# A tibble: 974 × 2
name n
<chr> <int>
1 Alexander 665492
2 Alexis 399551
3 Alex 278705
4 Alexandra 232223
5 Max 148787
6 Alexa 123032
7 Maxine 112261
8 Alexandria 97679
9 Maxwell 90486
10 Jaxon 71234
# ℹ 964 more rows
11.6.2 str_count()
Rather than a true or false, str_count() counts how many matches are in a string.
11.7 Extracting and Replacing Text
11.7.1 str_extract()
Extracts the first match:
str_extract(c("abc123", "xyz789"), "[0-9]+")[1] "123" "789"
11.7.2 str_replace()
Replaces matching patterns:
str_replace("apple pie", "apple", "peach")[1] "peach pie"
11.7.3 In-Class Exercise 3 – Extraction and Replacement
- Extract digits from a vector of alphanumeric strings.
- Replace the word
"dog"with"puppy"in a text vector.
v <- c('1','fg3','5gh','j34', '0978')
str_extract(v, "[0-9]+")[1] "1" "3" "5" "34" "0978"
str_replace(c("Big dog", "Bird dog", "Silly dog"), "dog", "puppy")[1] "Big puppy" "Bird puppy" "Silly puppy"
11.8 Pattern details
These regular expressions allow us to search our data with more detail.
11.8.1 Escaping
You can use str_view to view the strings without the escapes.
11.8.2 Anchors
Regular expressions match any part of a string by default. Sometimes you just want the start or end, so you need to anchor your regex:
^ matches the start:
$ matches the end:
11.8.3 Character classes
Let you match a set of characters: Let’s use the string vector stringr::words:
[284] │ <exa>ct
[285] │ <exa>mple
[288] │ <exe>rcise
[289] │ <exi>st
Or, if you want anything except your character class:
-
-defines a range ([a-z]' or '[0-9]).
-
\escapes a special character, so[\^\-\]]matches^,-``, or]`.
11.8.4 Quantifiers
These control how many times a pattern can match.
- ? makes a pattern optional (i.e. it matches 0 or 1 times)
- + lets a pattern repeat (i.e. it matches at least once)
- * lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).
[1] │ <a>
[2] │ <ab>
[3] │ <ab>b
[2] │ <ab>
[3] │ <abb>
[1] │ <a>
[2] │ <ab>
[3] │ <abb>
You can specify the number of matches with {}.\
{n}- matches exactly n times.{n,}- matches at least n times.{n,m}- matches between n and m times.
11.9 In-Class Challenge
Using stringr::words, create regular expressions that find all words that:
- Start with “y”.
- Don’t start with “y”.
- End with “x”.
- Are exactly three letters long. (Don’t cheat by using str_length()!)
- Have seven letters or more.
- Contain a vowel-consonant pair.
- Contain at least two vowel-consonant pairs in a row.
- Only consist of repeated vowel-consonant pairs.
————————————————————————
11.10 Next Steps
Next, we will learn to work with factors and categorical data using the forcats package.