11 Strings and Regular Expressions with stringr

11.1 Learning Objectives

By the end of this chapter, you should be able to:

Manipulate strings using the stringr package
Detect patterns with regular expressions (regex)
Extract, replace, and split text
Clean messy text data for analysis

11.2 Introduction to `stringr`

The stringr package provides consistent, simple functions for string operations.

Load the library:

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(stringr)

11.3 Creating and Inspecting Strings

11.3.1 Simple string creation:

string1 <- "Here is a string"
string2 <- 'Single quotes work also'
string3 <- 'If I want to put a "quote" in a string, I use single quotes.'

11.4 Creating strings from data

11.4.1 `str_c()`

This takes vectors as arguments and returns a character vector.

Here is a simple string:

x <- c("apple", "banana", "pear")
x

[1] "apple"  "banana" "pear"

We can combine similar strings with str_c:

str_c("apple", "banana", "pear")

[1] "applebananapear"

str_c("apple ", "banana ", "pear ")

[1] "apple banana pear "

str_c(fruit, " is tasty")

 [1] "apple is tasty"             "apricot is tasty"          
 [3] "avocado is tasty"           "banana is tasty"           
 [5] "bell pepper is tasty"       "bilberry is tasty"         
 [7] "blackberry is tasty"        "blackcurrant is tasty"     
 [9] "blood orange is tasty"      "blueberry is tasty"        
[11] "boysenberry is tasty"       "breadfruit is tasty"       
[13] "canary melon is tasty"      "cantaloupe is tasty"       
[15] "cherimoya is tasty"         "cherry is tasty"           
[17] "chili pepper is tasty"      "clementine is tasty"       
[19] "cloudberry is tasty"        "coconut is tasty"          
[21] "cranberry is tasty"         "cucumber is tasty"         
[23] "currant is tasty"           "damson is tasty"           
[25] "date is tasty"              "dragonfruit is tasty"      
[27] "durian is tasty"            "eggplant is tasty"         
[29] "elderberry is tasty"        "feijoa is tasty"           
[31] "fig is tasty"               "goji berry is tasty"       
[33] "gooseberry is tasty"        "grape is tasty"            
[35] "grapefruit is tasty"        "guava is tasty"            
[37] "honeydew is tasty"          "huckleberry is tasty"      
[39] "jackfruit is tasty"         "jambul is tasty"           
[41] "jujube is tasty"            "kiwi fruit is tasty"       
[43] "kumquat is tasty"           "lemon is tasty"            
[45] "lime is tasty"              "loquat is tasty"           
[47] "lychee is tasty"            "mandarine is tasty"        
[49] "mango is tasty"             "mulberry is tasty"         
[51] "nectarine is tasty"         "nut is tasty"              
[53] "olive is tasty"             "orange is tasty"           
[55] "pamelo is tasty"            "papaya is tasty"           
[57] "passionfruit is tasty"      "peach is tasty"            
[59] "pear is tasty"              "persimmon is tasty"        
[61] "physalis is tasty"          "pineapple is tasty"        
[63] "plum is tasty"              "pomegranate is tasty"      
[65] "pomelo is tasty"            "purple mangosteen is tasty"
[67] "quince is tasty"            "raisin is tasty"           
[69] "rambutan is tasty"          "raspberry is tasty"        
[71] "redcurrant is tasty"        "rock melon is tasty"       
[73] "salal berry is tasty"       "satsuma is tasty"          
[75] "star fruit is tasty"        "strawberry is tasty"       
[77] "tamarillo is tasty"         "tangerine is tasty"        
[79] "ugli fruit is tasty"        "watermelon is tasty"

11.4.2 Letters

str_length gives you the number of letters in a string. You can provide a vector with multiple strings:

str_length(c("x", "CS506: Data Wrangling and Management", NA))

[1]  1 36 NA

11.4.3 In-Class Exercise 1 – Basic String Operations

Create a vector of at least 5 words.
Measure their lengths with str_length().
Concatenate them with the phrase " is cool".

lore <- c("Pokemon", "Harry Potter", "LOTR", "Star Wars", "Rocky")
str_length(lore)

[1]  7 12  4  9  5

str_c(lore, " is cool")

[1] "Pokemon is cool"      "Harry Potter is cool" "LOTR is cool"        
[4] "Star Wars is cool"    "Rocky is cool"

11.5 In-Class Challenge

Install the babynames package and load the library. Invoke glimpse() to understand the column names.
Use filter() to look at the longest names (those with exactly 15 letters).
Use count() to find the distribution of lengths of names (Hint: use the wt argument).

library(babynames)
babynames |>
  filter(str_length(name) == 15) |>
  count(name, wt = n) |>
  arrange(desc(n))

# A tibble: 34 × 2
   name                n
   <chr>           <int>
 1 Franciscojavier   123
 2 Christopherjohn   118
 3 Johnchristopher   118
 4 Christopherjame   108
 5 Christophermich    52
 6 Ryanchristopher    45
 7 Mariadelosangel    28
 8 Jonathanmichael    25
 9 Christianjoseph    22
10 Christopherjose    22
# ℹ 24 more rows

11.6 Detecting Patterns with Regular Expressions

“Regex” are a powerful way to detecting and describing patterns in strings.

###str_view str_view can take a regular expression as its second argument.

Here it is with a literal character:

str_view(stringr::fruit, "berry")

 [6] │ bil<berry>
 [7] │ black<berry>
[10] │ blue<berry>
[11] │ boysen<berry>
[19] │ cloud<berry>
[21] │ cran<berry>
[29] │ elder<berry>
[32] │ goji <berry>
[33] │ goose<berry>
[38] │ huckle<berry>
[50] │ mul<berry>
[70] │ rasp<berry>
[73] │ salal <berry>
[76] │ straw<berry>

But you can also use a punctuation character like ., +, *, [, ], and ?. These are called metacharacters.

. will match any character. "a." will match anything that is “a” followed by another character:

str_view(c("a", "ab", "ae", "bd", "ea", "eab"), "a.")

[2] │ <ab>
[3] │ <ae>
[6] │ e<ab>

We can also look for anything with an “a”, followed by three letters, followed by an “e”:

str_view(stringr::fruit, "a...e")

 [1] │ <apple>
 [7] │ bl<ackbe>rry
[48] │ mand<arine>
[51] │ nect<arine>
[62] │ pine<apple>
[64] │ pomegr<anate>
[70] │ r<aspbe>rry
[73] │ sal<al be>rry

###str_detect() Returns TRUE if a pattern is found.

animals <- c("dog", "cat", "parrot", "cow")
str_detect(animals, "o")

[1]  TRUE FALSE  TRUE  TRUE

You can use regular expressions for more complex patterns.

Examples:

^a – starts with “a”
ing$ – ends with “ing”
[0-9]+ – one or more digits

animals <- c("ant", "bat", "cat", "dog")
str_detect(animals, "^a")

[1]  TRUE FALSE FALSE FALSE

11.6.1 In-Class Exercise 2 – Pattern Detection

Create a vector of email-like strings.
Use str_detect() to check which contain "@".
Write a regex to detect strings ending in .com.

emails <- c("marc@nau.edu", "velma@gmail.com", "jack@yahoo.com", "sam@apple.com")
str_detect(emails, "@")

[1] TRUE TRUE TRUE TRUE

str_detect(emails, '.com$')

[1] FALSE  TRUE  TRUE  TRUE

str_detect returns a logical vector that is TRUE if the pattern matches an element of the character vector and FALSE if it doesn’t.

str_detect(c("a", "b", "c"), "[aeiou]")

[1]  TRUE FALSE FALSE

This pairs quite well with filter():

babynames |>
  filter(str_detect(name, "x")) |>
  count(name, wt = n, sort = TRUE)

# A tibble: 974 × 2
   name            n
   <chr>       <int>
 1 Alexander  665492
 2 Alexis     399551
 3 Alex       278705
 4 Alexandra  232223
 5 Max        148787
 6 Alexa      123032
 7 Maxine     112261
 8 Alexandria  97679
 9 Maxwell     90486
10 Jaxon       71234
# ℹ 964 more rows

11.6.2 `str_count()`

Rather than a true or false, str_count() counts how many matches are in a string.

x <- c("apple", "banana", "pear")
str_count(x, "p")

[1] 2 0 1

11.7 Extracting and Replacing Text

11.7.1 `str_extract()`

Extracts the first match:

str_extract(c("abc123", "xyz789"), "[0-9]+")

[1] "123" "789"

11.7.2 `str_replace()`

Replaces matching patterns:

str_replace("apple pie", "apple", "peach")

[1] "peach pie"

11.7.3 In-Class Exercise 3 – Extraction and Replacement

Extract digits from a vector of alphanumeric strings.
Replace the word "dog" with "puppy" in a text vector.

v <- c('1','fg3','5gh','j34', '0978')
str_extract(v, "[0-9]+")

[1] "1"    "3"    "5"    "34"   "0978"

str_replace(c("Big dog", "Bird dog", "Silly dog"), "dog", "puppy")

[1] "Big puppy"   "Bird puppy"  "Silly puppy"

11.8 Pattern details

These regular expressions allow us to search our data with more detail.

11.8.1 Escaping

You can use str_view to view the strings without the escapes.

double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
x <- c(double_quote, single_quote)
str_view(x)

[1] │ "
[2] │ '

11.8.2 Anchors

Regular expressions match any part of a string by default. Sometimes you just want the start or end, so you need to anchor your regex:

^ matches the start:

str_view(stringr::fruit, "^a")

[1] │ <a>pple
[2] │ <a>pricot
[3] │ <a>vocado

$ matches the end:

str_view(stringr::fruit, "a$")

 [4] │ banan<a>
[15] │ cherimoy<a>
[30] │ feijo<a>
[36] │ guav<a>
[56] │ papay<a>
[74] │ satsum<a>

11.8.3 Character classes

Let you match a set of characters: Let’s use the string vector stringr::words:

str_view(stringr::words, "[aeiou]x[aeiou]")

[284] │ <exa>ct
[285] │ <exa>mple
[288] │ <exe>rcise
[289] │ <exi>st

Or, if you want anything except your character class:

str_view(stringr::words, "[^aeiou]y[^aeiou]")

[836] │ <sys>tem
[901] │ <typ>e

- defines a range ([a-z]' or '[0-9]).
\ escapes a special character, so [\^\-\]] matches ^, -``, or]`.

11.8.4 Quantifiers

These control how many times a pattern can match.
- ? makes a pattern optional (i.e. it matches 0 or 1 times)
- + lets a pattern repeat (i.e. it matches at least once)
- * lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).

# ab? matches an "a", optionally followed by a "b".
str_view(c("a", "ab", "abb"), "ab?")

[1] │ <a>
[2] │ <ab>
[3] │ <ab>b

# ab+ matches an "a", followed by at least one "b".
str_view(c("a", "ab", "abb"), "ab+")

[2] │ <ab>
[3] │ <abb>

# ab* matches an "a", followed by any number of "b"s.
str_view(c("a", "ab", "abb"), "ab*")

[1] │ <a>
[2] │ <ab>
[3] │ <abb>

You can specify the number of matches with {}.\

{n} - matches exactly n times.
{n,} - matches at least n times.
{n,m} - matches between n and m times.

11.9 In-Class Challenge

Using stringr::words, create regular expressions that find all words that:
- Start with “y”.
- Don’t start with “y”.
- End with “x”.
- Are exactly three letters long. (Don’t cheat by using str_length()!)
- Have seven letters or more.
- Contain a vowel-consonant pair.
- Contain at least two vowel-consonant pairs in a row.
- Only consist of repeated vowel-consonant pairs.
————————————————————————

11.10 Next Steps

Next, we will learn to work with factors and categorical data using the forcats package.