Cleaning Messy Strings Using stringr and Regexps
Motivation
I recently completed several projects that involved cleaning lots of messy data, in particular, free text strings that I needed to organize into tidy variables and text-mine for insights. To do this efficiently, I learned how to use regular expressions (regexps), as suggested by Hadley Wickham and Garrett Grolemund’s in R for Data Science.
As the authors suggest, it takes a little while to learn the rules of regexps and get comfortable using them, but once you do, you’ll see they’re a powerful tool to have in your data cleaning toolbox. The best resources I found for learning and practicing regexps are the RStudio stringr cheatsheet (found here) and this handy interactive program.
Below is a bit of practice with stringr and regexps using Wikipedia’s list of tallest people on record. I realize this is not exactly a health topic, but it’s a simple motivating example that can be leveraged to any messy dataset containing strings. We’ll be using the rvest package to scrape an html table from the Wikipedia page, as outlined here and here.
Import
The original table looks like this:

The steps below scrape the data using rvest.
# wikipedia source url
url <- 'https://en.wikipedia.org/wiki/List_of_tallest_people'
# specify table
tall_table <-
url %>%
read_html() %>%
html_nodes(xpath = ('//*[@id="mw-content-text"]/div[1]/table[2]')) %>%
html_table()
tall_table <-
tall_table[[1]] %>%
janitor::clean_names()
tall_table %>%
select(c(metric, lifespan_age_at_death, imperial, note))
## # A tibble: 143 × 4
## metric lifespan_age_at_death imperial note
## <chr> <chr> <chr> <chr>
## 1 272 cm 1918–1940 (22) 8 ft 11.1 in Tallest human in recorded histor…
## 2 269 cm 1989–2015 (26) 8 ft 10 in Was not measured officially by G…
## 3 268 cm 1986–2007 (21) 8 ft 9.5 in Was not measured officially by G…
## 4 267 cm 1868–1905 (37) 8 ft 9 in Second tallest male in recorded …
## 5 263.5 cm 1932–1969 (37) 8 ft 7.75 in 8 ft 0 in (244 cm) standing heig…
## 6 262 cm 1924–1943 (18) 8 ft 7 in Billed at 8 ft 7 in (262 cm).[16]
## 7 262 cm 1906–1952 (46) 8 ft 7 in Jacob Rheuben Ehrlich. American …
## 8 258 cm 1860–1887 (27) 8 ft 5.5 in Tallest human in European histor…
## 9 258 cm 1970–2014 (44) 8 ft 5.5 in Not officially recognized by Gui…
## 10 257 cm 1984–2019 (34) 8 ft 5.25 in After he died he was measured po…
## # … with 133 more rows
Clearly this requires some data cleaning. Specifically, we need to address the following problems:
- The variables metric and imperial mix numbers and text to express height in their respective units
- The lifespan_age_at_death variable combines three numeric variables into one column
- The note columns often contains reference numbers in brackets
Tidy
1. The metric Variable
First we’ll clean up metric by removing the “cm” on the end of every observation, converting it to a numeric variable, and updating the variable name.
# remove "cm"
tall_table_1 <-
tall_table %>%
mutate(metric = str_replace(metric, "cm", ""))
head(tall_table_1$metric)
## [1] "272 " "269 " "268 " "267 " "263.5 " "262 "
Here we run into our first problem: trailing blanks. When we removed “cm” from the strings, we created phantom blanks, as evidenced by the space between the last digit and the end quote. These will prevent us from cleanly coercing the characters to numbers:
tall_table_test <-
tall_table_1 %>%
mutate(metric = as.numeric(metric))
head(tall_table_test$metric)
## [1] NA NA NA NA NA NA
This is very easy to fix with str_trim:
# trim trailing blanks
tall_table_2 <-
tall_table_1 %>%
mutate(metric = str_trim(metric, side = "both"),
metric = as.numeric(metric)) %>%
rename(ht_cm = metric)
tall_table_2$ht_cm
## [1] 272.00 269.00 268.00 267.00 263.50 262.00 262.00 258.00 258.00 257.00
## [11] 256.00 255.00 251.40 251.00 250.00 248.90 248.90 246.00 246.00 246.00
## [21] 245.70 245.00 245.00 244.00 244.00 244.00 244.00 244.00 244.00 244.00
## [31] 242.00 242.00 242.00 242.00 241.00 241.00 240.00 240.00 240.00 239.00
## [41] 238.00 238.00 237.00 236.00 236.00 236.00 236.00 236.00 236.00 236.00
## [51] 236.00 232.00 236.00 236.17 236.10 232.00 235.90 233.50 235.00 235.00
## [61] 235.00 235.00 235.00 235.00 234.00 234.00 234.00 234.00 234.00 234.00
## [71] 234.00 234.00 234.00 234.00 234.00 233.00 233.00 233.00 233.00 233.00
## [81] 231.00 233.00 231.00 231.00 231.00 231.00 231.00 231.00 231.00 231.00
## [91] 231.00 230.00 230.00 229.00 229.00 229.00 229.00 229.00 229.00 229.00
## [101] 229.00 229.00 229.00 229.00 229.00 229.00 229.00 229.00 229.00 229.00
## [111] 229.00 228.00 228.00 228.00 228.00 228.00 227.00 226.00 226.00 226.00
## [121] 226.00 226.00 226.00 225.00 225.00 225.00 225.00 224.00 224.00 223.00
## [131] 222.00 222.00 222.00 221.00 221.00 221.00 221.00 221.00 218.00 218.00
## [141] 218.00 216.00 210.00
Now we move on to the imperial variable, which takes on the following (un-tidy) values:
tall_table_2$imperial
## [1] "8 ft 11.1 in"
## [2] "8 ft 10 in"
## [3] "8 ft 9.5 in"
## [4] "8 ft 9 in"
## [5] "8 ft 7.75 in"
## [6] "8 ft 7 in"
## [7] "8 ft 7 in"
## [8] "8 ft 5.5 in"
## [9] "8 ft 5.5 in"
## [10] "8 ft 5.25 in"
## [11] "8 ft 4.8 in"
## [12] "8 ft 4.5 in"
## [13] "8 ft 3 in"
## [14] "8 ft 2.8 in[25]"
## [15] "8 ft 2.4 in"
## [16] "8 ft 2 in"
## [17] "8 ft 2 in"
## [18] "8 ft 1 in[29]"
## [19] "8 ft 1 in"
## [20] "8 ft 1 in"
## [21] "8 ft 0.75 in"
## [22] "8 ft 0.5 in"
## [23] "8 ft 0.4 in"
## [24] "8 ft 0 in-8 ft 1 in"
## [25] "8 ft 0 in"
## [26] "8 ft 0 in"
## [27] "8 ft 0 in"
## [28] "8 ft 0 in"
## [29] "8 ft 0 in"
## [30] "8 ft 0 in"
## [31] "7 ft 11.5 in"
## [32] "7 ft 11.5 in"
## [33] "7 ft 11.5 in"
## [34] "7 ft 11.5 in"
## [35] "7 ft 11 in"
## [36] "7 ft 11 in"
## [37] "7 ft 10.5 in"
## [38] "7 ft 10.5 in"
## [39] "7 ft 10.5 in"
## [40] "7 ft 10 in"
## [41] "7 ft 10 in"
## [42] "7 ft 10 in"
## [43] "7 ft 9.3 in"
## [44] "7 ft 9 in"
## [45] "7 ft 9 in"
## [46] "7 ft 9 in"
## [47] "7 ft 9 in"
## [48] "7 ft 9 in"
## [49] "7 ft 9 in"
## [50] "7 ft 9 in"
## [51] "7 ft 9 in"
## [52] "7 ft 7.3 in"
## [53] "7 ft 9 in"
## [54] "7 ft 8.98 in"
## [55] "7 ft 8.95 in"
## [56] "7 ft 7¼ in"
## [57] "7 ft 8.87 in"
## [58] "7 ft .mw-parser-output .frac{white-space:nowrap}.mw-parser-output .frac .num,.mw-parser-output .frac .den{font-size:80%;line-height:0;vertical-align:super}.mw-parser-output .frac .den{vertical-align:sub}.mw-parser-output .sr-only{border:0;clip:rect(0,0,0,0);height:1px;margin:-1px;overflow:hidden;padding:0;position:absolute;width:1px}8+1⁄3 in"
## [59] "7 ft 81⁄2 in"
## [60] "7 ft 81⁄2 in"
## [61] "7 ft 8+1⁄2 in"
## [62] "7 ft 8+1⁄2 in"
## [63] "7 ft 8 in"
## [64] "7 ft 8 in"
## [65] "7 ft 8 in"
## [66] "7 ft 8 in"
## [67] "7 ft 8 in"
## [68] "7 ft 8 in"
## [69] "7 ft 8 in"
## [70] "7 ft 8 in"
## [71] "7 ft 8 in"
## [72] "7 ft 8 in"
## [73] "7 ft 8 in"
## [74] "7 ft 8 in"
## [75] "7 ft 8 in"
## [76] "7 ft 7.7 in"
## [77] "7 ft 7.7 in"
## [78] "7 ft 7+1⁄2 in"
## [79] "7 ft 7.4 in"
## [80] "7 ft 7.4 in"
## [81] "7 ft 7.26 in"
## [82] "7 ft 7 in"
## [83] "7 ft 7 in"
## [84] "7 ft 7 in"
## [85] "7 ft 7 in"
## [86] "7 ft 7 in"
## [87] "7 ft 7 in"
## [88] "7 ft 7 in"
## [89] "7 ft 7 in"
## [90] "7 ft 7 in"
## [91] "7 ft 7 in"
## [92] "7 ft 7 in"
## [93] "7 ft 6.75 in"
## [94] "7 ft 6.25 in"
## [95] "7 ft 6.25 in"
## [96] "7 ft 6+1⁄4 in"
## [97] "7 ft 6+1⁄5 in"
## [98] "7 ft 6 in"
## [99] "7 ft 6 in"
## [100] "7 ft 6 in"
## [101] "7 ft 6 in"
## [102] "7 ft 6 in"
## [103] "7 ft 6 in"
## [104] "7 ft 6 in"
## [105] "7 ft 6 in"
## [106] "7 ft 6 in"
## [107] "7 ft 6 in"
## [108] "7 ft 6 in"
## [109] "7 ft 6 in"
## [110] "7 ft 6 in"
## [111] "7 ft 5.8 in"
## [112] "7 ft 5+2⁄3 in"
## [113] "7 ft 5+2⁄3 in"
## [114] "7 ft 5+2⁄3 in"
## [115] "7 ft 5+2⁄3 in"
## [116] "7 ft 5+2⁄3 in"
## [117] "7 ft 5+1⁄3"
## [118] "7 ft 5 in"
## [119] "7 ft 5 in"
## [120] "7 ft 5 in"
## [121] "7 ft 5 in"
## [122] "7 ft 5 in"
## [123] "7 ft 5 in"
## [124] "7 ft 4.5 in"
## [125] "7 ft 4+1⁄2 in"
## [126] "7 ft 4+1⁄2 in"
## [127] "7 ft 4.5 in"
## [128] "7 ft 4 in"
## [129] "7 ft 4 in"
## [130] "7 ft 4 in"
## [131] "7 ft 3.4 in"
## [132] "7 ft 3.5 in"
## [133] "7 ft 3.5 in"
## [134] "7 ft 3 in"
## [135] "7 ft 3 in"
## [136] "7 ft 3 in"
## [137] "7 ft 3 in"
## [138] "7 ft 3 in"
## [139] "7 ft 2 in"
## [140] "7 ft 2 in"
## [141] "7 ft 2 in"
## [142] "7 ft 1 in"
## [143] "7 ft 0 in"
We’ve got numbers with brackets (links to footnotes in the original Wikipedia page), we’ve got some inches measures that look like “1.5 inches”, some that look like “1+1/2 in”, we have some that drop the plus (“81/2 in”, which should probably read “8 1/2 in”), and then we have one observation that seems to have carried over some style code.
First we’ll fix the style code observation by manually overwriting the value using the original table. We’ll also manually fix the observation that lists a height range (“8 ft 0 in-8 ft 1 in”) since for our purposes, there’s not much difference between an 8-foot person and an 8-foot-one person - they’re both very, very tall.
# manual fixes
tall_table_3 <-
tall_table_2 %>%
mutate(imperial =
case_when(name == "Rafael França do Nascimento" ~ "7 ft 8+1⁄3 in",
name == "James Toller" ~ "8 ft 0 in",
TRUE ~ imperial) ,
imperial = str_trim(imperial, side = "both")
)
tall_table_3$imperial
## [1] "8 ft 11.1 in" "8 ft 10 in" "8 ft 9.5 in" "8 ft 9 in"
## [5] "8 ft 7.75 in" "8 ft 7 in" "8 ft 7 in" "8 ft 5.5 in"
## [9] "8 ft 5.5 in" "8 ft 5.25 in" "8 ft 4.8 in" "8 ft 4.5 in"
## [13] "8 ft 3 in" "8 ft 2.8 in[25]" "8 ft 2.4 in" "8 ft 2 in"
## [17] "8 ft 2 in" "8 ft 1 in[29]" "8 ft 1 in" "8 ft 1 in"
## [21] "8 ft 0.75 in" "8 ft 0.5 in" "8 ft 0.4 in" "8 ft 0 in"
## [25] "8 ft 0 in" "8 ft 0 in" "8 ft 0 in" "8 ft 0 in"
## [29] "8 ft 0 in" "8 ft 0 in" "7 ft 11.5 in" "7 ft 11.5 in"
## [33] "7 ft 11.5 in" "7 ft 11.5 in" "7 ft 11 in" "7 ft 11 in"
## [37] "7 ft 10.5 in" "7 ft 10.5 in" "7 ft 10.5 in" "7 ft 10 in"
## [41] "7 ft 10 in" "7 ft 10 in" "7 ft 9.3 in" "7 ft 9 in"
## [45] "7 ft 9 in" "7 ft 9 in" "7 ft 9 in" "7 ft 9 in"
## [49] "7 ft 9 in" "7 ft 9 in" "7 ft 9 in" "7 ft 7.3 in"
## [53] "7 ft 9 in" "7 ft 8.98 in" "7 ft 8.95 in" "7 ft 7¼ in"
## [57] "7 ft 8.87 in" "7 ft 8+1⁄3 in" "7 ft 81⁄2 in" "7 ft 81⁄2 in"
## [61] "7 ft 8+1⁄2 in" "7 ft 8+1⁄2 in" "7 ft 8 in" "7 ft 8 in"
## [65] "7 ft 8 in" "7 ft 8 in" "7 ft 8 in" "7 ft 8 in"
## [69] "7 ft 8 in" "7 ft 8 in" "7 ft 8 in" "7 ft 8 in"
## [73] "7 ft 8 in" "7 ft 8 in" "7 ft 8 in" "7 ft 7.7 in"
## [77] "7 ft 7.7 in" "7 ft 7+1⁄2 in" "7 ft 7.4 in" "7 ft 7.4 in"
## [81] "7 ft 7.26 in" "7 ft 7 in" "7 ft 7 in" "7 ft 7 in"
## [85] "7 ft 7 in" "7 ft 7 in" "7 ft 7 in" "7 ft 7 in"
## [89] "7 ft 7 in" "7 ft 7 in" "7 ft 7 in" "7 ft 7 in"
## [93] "7 ft 6.75 in" "7 ft 6.25 in" "7 ft 6.25 in" "7 ft 6+1⁄4 in"
## [97] "7 ft 6+1⁄5 in" "7 ft 6 in" "7 ft 6 in" "7 ft 6 in"
## [101] "7 ft 6 in" "7 ft 6 in" "7 ft 6 in" "7 ft 6 in"
## [105] "7 ft 6 in" "7 ft 6 in" "7 ft 6 in" "7 ft 6 in"
## [109] "7 ft 6 in" "7 ft 6 in" "7 ft 5.8 in" "7 ft 5+2⁄3 in"
## [113] "7 ft 5+2⁄3 in" "7 ft 5+2⁄3 in" "7 ft 5+2⁄3 in" "7 ft 5+2⁄3 in"
## [117] "7 ft 5+1⁄3" "7 ft 5 in" "7 ft 5 in" "7 ft 5 in"
## [121] "7 ft 5 in" "7 ft 5 in" "7 ft 5 in" "7 ft 4.5 in"
## [125] "7 ft 4+1⁄2 in" "7 ft 4+1⁄2 in" "7 ft 4.5 in" "7 ft 4 in"
## [129] "7 ft 4 in" "7 ft 4 in" "7 ft 3.4 in" "7 ft 3.5 in"
## [133] "7 ft 3.5 in" "7 ft 3 in" "7 ft 3 in" "7 ft 3 in"
## [137] "7 ft 3 in" "7 ft 3 in" "7 ft 2 in" "7 ft 2 in"
## [141] "7 ft 2 in" "7 ft 1 in" "7 ft 0 in"
Next, we’ll remove the un-needed reference numbers in brackets. We can do that manually, too, but it’s not exactly a big-data approach, so let’s use regular expressions (regexps) instead.
The challenge here is that brackets are a special character in regex, so we’ll need to “escape” them (meaning, get R to understand that we mean a literal bracket). The stringr cheat sheet serves as a great summary. You’ll notice that to escape a bracket, you need a double backslash in front of it.
Here’s a demo of how it works:
# demonstrate regexps
tall_table_3 %>%
mutate(test = str_extract(imperial, "\\[\\d*\\]")) %>%
filter(str_detect(imperial, "\\[\\d*\\]")) %>%
select(imperial, test)
## # A tibble: 2 × 2
## imperial test
## <chr> <chr>
## 1 8 ft 2.8 in[25] [25]
## 2 8 ft 1 in[29] [29]
tall_table_4 <-
tall_table_3 %>%
mutate(imperial = str_remove(imperial, "\\[\\d*\\]"))
Next, we need to separate the feet and inches values so we can convert to a value like XX.XX inches, allowing us to do numerical manipulation on the value. The pattern in imperial is now “X ft XX in”, where the inches are expressed in different ways (1/2, 0.5). Let’s do inches first. We want to extract out the number that lies between “ft” and “in”. For this, we can use “look around” regexps, that find a pattern preceding or following another pattern. Roughly speaking:
- “a(?=c)” means “a” followed by “c”
- “(?<=b)a” means “a” preceded by “b”
- Note that “[:graph:]” means any letter, number, or punctuation, and the asterisk after it means repeated zero ore more times.
# demo
tall_table_4 %>%
mutate(inches = str_extract(imperial, "(?<=ft)\\s[:graph:]*\\s(?=in)")) %>%
select(imperial, inches)
## # A tibble: 143 × 2
## imperial inches
## <chr> <chr>
## 1 8 ft 11.1 in " 11.1 "
## 2 8 ft 10 in " 10 "
## 3 8 ft 9.5 in " 9.5 "
## 4 8 ft 9 in " 9 "
## 5 8 ft 7.75 in " 7.75 "
## 6 8 ft 7 in " 7 "
## 7 8 ft 7 in " 7 "
## 8 8 ft 5.5 in " 5.5 "
## 9 8 ft 5.5 in " 5.5 "
## 10 8 ft 5.25 in " 5.25 "
## # … with 133 more rows
# extract inches value and remove everything from "ft" onward
tall_table_5 <-
tall_table_4 %>%
mutate(inches = str_extract(imperial, "(?<=ft)\\s[:graph:]*\\s(?=in)"),
imperial = str_remove(imperial, "\\s(ft)\\s[:graph:]*\\s[:graph:]*")) %>%
rename(ht_ft = imperial)
# trim white-space
tall_table_6 <-
tall_table_5 %>%
mutate(inches = str_trim(inches, side = c("both")),
ht_ft = str_trim(ht_ft, side = c("both")))
Now, we get to the most challenging part: transforming inches into a numerical variable. This means converting numbers written as fractions into decimals, while not transforming the already-clean decimal values. The simplest way to do this is probably to find a forward-slash (/), and find the numbers immediately before and after it. After that, we’ll separate out the numbers and divide them to arrive at a final, numerical inches_decimal.
# clean inches variable
tall_table_7 <-
tall_table_6 %>%
mutate(inches_fraction = str_extract(inches, "\\d\\⁄\\d"),
inches = str_remove(inches, "\\d\\⁄\\d"),
inches = str_remove(inches, "\\+"),
inches = str_trim(inches, side = c("both"))) %>%
mutate(inches_1 = as.numeric(str_sub(inches_fraction, 1, 1)),
inches_2 = as.numeric(str_sub(inches_fraction, -1, -1)),
inches_decimal = round((inches_1/inches_2), digits = 2))
# demo
tall_table_7 %>%
filter(!is.na(inches_decimal)) %>%
select(inches, inches_fraction, inches_1, inches_2, inches_decimal)
## # A tibble: 15 × 5
## inches inches_fraction inches_1 inches_2 inches_decimal
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 8 1⁄3 1 3 0.33
## 2 8 1⁄2 1 2 0.5
## 3 8 1⁄2 1 2 0.5
## 4 8 1⁄2 1 2 0.5
## 5 8 1⁄2 1 2 0.5
## 6 7 1⁄2 1 2 0.5
## 7 6 1⁄4 1 4 0.25
## 8 6 1⁄5 1 5 0.2
## 9 5 2⁄3 2 3 0.67
## 10 5 2⁄3 2 3 0.67
## 11 5 2⁄3 2 3 0.67
## 12 5 2⁄3 2 3 0.67
## 13 5 2⁄3 2 3 0.67
## 14 4 1⁄2 1 2 0.5
## 15 4 1⁄2 1 2 0.5
Finally, we will add inches to inches_decimal to arrive at the final portion of height in inches. We’ll also address the outlier “7¼” inches. Then, we’ll convert the now-clean inches to feet, and add it to ht_ft.
tall_table_8 <-
tall_table_7 %>%
mutate(inches = if_else(inches == "7¼", "7.25", inches),
inches = as.numeric(inches),
inches = if_else(is.na(inches), 0, inches),
in_to_feet = round((inches/12), digits = 2),
ht_ft = if_else(name == "Jeison Rodríguez", "7.44", ht_ft),
ht_ft = round(as.numeric(ht_ft), digits = 2),
ht_ft = ht_ft + in_to_feet
) %>%
select(-c(inches_fraction, inches, inches_1, inches_2, inches_decimal, in_to_feet))
Now we can do a fun exercise - we can check our work by directly converting ht_cm to height in feet (one foot is 30.48 cm), and seeing if all the wrangling we underwent with inches and feet gave us correct numbers.
# test whether calculated value matches string extract
test <-
tall_table_8 %>%
select(ht_ft, ht_cm) %>%
mutate(check_ht = round(ht_ft*30.48, digits = 2))
test
## # A tibble: 143 × 3
## ht_ft ht_cm check_ht
## <dbl> <dbl> <dbl>
## 1 8.92 272 272.
## 2 8.83 269 269.
## 3 8.79 268 268.
## 4 8.75 267 267.
## 5 8.65 264. 264.
## 6 8.58 262 262.
## 7 8.58 262 262.
## 8 8.46 258 258.
## 9 8.46 258 258.
## 10 8.44 257 257.
## # … with 133 more rows
We’re not dead on, but we got pretty close. The reason for the error has a lot more to do with rounding in the original table than our text-parsing. And after all that work, our data-frame is tidy-ish.
2. The lifespan_age_at_death Variable
This is what the lifespan_age_at_death column looks like now:
tall_table_8 %>%
select(lifespan_age_at_death)
## # A tibble: 143 × 1
## lifespan_age_at_death
## <chr>
## 1 1918–1940 (22)
## 2 1989–2015 (26)
## 3 1986–2007 (21)
## 4 1868–1905 (37)
## 5 1932–1969 (37)
## 6 1924–1943 (18)
## 7 1906–1952 (46)
## 8 1860–1887 (27)
## 9 1970–2014 (44)
## 10 1984–2019 (34)
## # … with 133 more rows
We want the year of birth, year of death, and age at death to be their own variables. Scanning the values, there are several different scenarios:
1. Birth year - death year (YYYY-YYYY) followed by (age or age range)
2. Range for birth year and death year (YYYY/YY - YYYY/YY)
3. “born” followed by a year of birth (YYYY) and no death year
4. Some text followed by “?” followed by a year
5. Year of birth
6. “born” followed by a range and some text
7. Some entries contain a reference number in brackets
First we’ll remove brackets and anything between them, addressing scenario 7. Then we’ll populate an age_at_death variable by extracting anything in parentheses.
# define regular expression for any string between parentheses, not including a parenthesis
regex_bet_paren <- "\\([^)]+\\)"
tall_table_9 <-
tall_table_8 %>%
mutate(lifespan_age_at_death = str_remove_all(lifespan_age_at_death, "\\[\\d*\\]"),
age_at_death = str_extract(lifespan_age_at_death, regex_bet_paren),
lifespan_age_at_death = str_remove_all(lifespan_age_at_death, regex_bet_paren),
age_at_death = str_remove_all(age_at_death, "\\(|\\)"),
lifespan_age_at_death = str_trim(lifespan_age_at_death, side = "both"),
age_at_death = str_trim(age_at_death, side = "both"),
age_at_death = if_else(age_at_death == "aged 24 as of March 2022", "", age_at_death)
)
Next we separate birth year and death year by the hyphen delimiter. Note the regexp meanings:
- \\d* means any quantity of consecutive digits
- (?=–) means something followed by a hyphen
- (?<=–) means something following a hyphen
tall_table_10 <-
tall_table_9 %>%
mutate(birth_year = str_extract(lifespan_age_at_death, "\\d*(?=–)"),
death_year = str_extract(lifespan_age_at_death, "(?<=–)\\d*"))
tall_table_10
## # A tibble: 143 × 9
## country ht_cm ht_ft name note lifes…¹ age_a…² birth…³ death…⁴
## <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 United States 272 8.92 Robert Wadlow Tall… 1918–1… 22 1918 1940
## 2 Thailand 269 8.83 Pornchai Sao… Was … 1989–2… 26 1989 2015
## 3 India 268 8.79 Vikas Uppal Was … 1986–2… 21 1986 2007
## 4 United States 267 8.75 John Rogan Seco… 1868–1… 37 1868 1905
## 5 United States 264. 8.65 John F. Carr… 8 ft… 1932–1… 37 1932 1969
## 6 United States 262 8.58 Willie Camper Bill… 1924–1… 18 1924 1943
## 7 United States 262 8.58 Jack Earle Jaco… 1906–1… 46 1906 1952
## 8 Austria 258 8.46 Franz Winkel… Tall… 1860–1… 27 1860 1887
## 9 Ukraine 258 8.46 Leonid Stadn… Not … 1970–2… 44 1970 2014
## 10 Vietnam 257 8.44 Hồ Văn Trung Afte… 1984–2… 34 1984 2019
## # … with 133 more rows, and abbreviated variable names ¹lifespan_age_at_death,
## # ²age_at_death, ³birth_year, ⁴death_year
Next we’ll address scenario 3: “born” followed by a year of birth (YYYY) and no death year. We’ll also make some manual fixes for a few observtions that don’t translate using our rules.
# populate birth year using a number preceded by "born"
tall_table_11 <-
tall_table_10 %>%
mutate(birth_year = if_else(is.na(birth_year), str_extract(lifespan_age_at_death, "(?<=born )\\d*"), birth_year))
# manual fixes
tall_table_12 <-
tall_table_11 %>%
mutate(birth_year = case_when(lifespan_age_at_death == "1958-1992" ~ "1958",
lifespan_age_at_death == "c. born 1995" ~ "1995",
lifespan_age_at_death == "1844-1899" ~"1844",
lifespan_age_at_death == "1946-1993" ~"1946",
lifespan_age_at_death == "1880/85 – 1925/30" ~ "1880",
TRUE ~ birth_year),
death_year = case_when(lifespan_age_at_death == "1958-1992" ~ "1992",
lifespan_age_at_death == "fl.?-March 2019" ~ "2019",
lifespan_age_at_death == "1844-1899" ~ "1899",
lifespan_age_at_death == "1946-1993" ~ "1993",
lifespan_age_at_death == "1880/85 – 1925/30" ~ "1925", TRUE ~ death_year)) %>%
mutate(birth_year = as.numeric(birth_year),
death_year = as.numeric(death_year)) %>%
filter(!lifespan_age_at_death == "173–238")
# check
tall_table_12 %>%
select(lifespan_age_at_death, birth_year, death_year) %>%
filter(is.na(birth_year) | is.na(death_year))
## # A tibble: 65 × 3
## lifespan_age_at_death birth_year death_year
## <chr> <dbl> <dbl>
## 1 born 1982 1982 NA
## 2 born 1982 1982 NA
## 3 born 1987 1987 NA
## 4 born 1983 1983 NA
## 5 1835–? 1835 NA
## 6 born 1987 1987 NA
## 7 born 1994 1994 NA
## 8 born 1999 1999 NA
## 9 fl. ?–1596 NA 1596
## 10 born 1966 1966 NA
## # … with 55 more rows
3. The note Variable
Finally, we’ll clean up note by removing anything between brackets.
# clean note variable
tall_table_13 <-
tall_table_12 %>%
mutate(note = str_remove_all(note, "\\[\\d*\\]"))
tall_table_13
## # A tibble: 142 × 9
## country ht_cm ht_ft name note lifes…¹ age_a…² birth…³ death…⁴
## <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 United States 272 8.92 Robert Wadlow Tall… 1918–1… 22 1918 1940
## 2 Thailand 269 8.83 Pornchai Sao… Was … 1989–2… 26 1989 2015
## 3 India 268 8.79 Vikas Uppal Was … 1986–2… 21 1986 2007
## 4 United States 267 8.75 John Rogan Seco… 1868–1… 37 1868 1905
## 5 United States 264. 8.65 John F. Carr… 8 ft… 1932–1… 37 1932 1969
## 6 United States 262 8.58 Willie Camper Bill… 1924–1… 18 1924 1943
## 7 United States 262 8.58 Jack Earle Jaco… 1906–1… 46 1906 1952
## 8 Austria 258 8.46 Franz Winkel… Tall… 1860–1… 27 1860 1887
## 9 Ukraine 258 8.46 Leonid Stadn… Not … 1970–2… 44 1970 2014
## 10 Vietnam 257 8.44 Hồ Văn Trung Afte… 1984–2… 34 1984 2019
## # … with 132 more rows, and abbreviated variable names ¹lifespan_age_at_death,
## # ²age_at_death, ³birth_year, ⁴death_year
Conclusion
That’s more or less all there is to it. If you’re working with big data, it’ll be a lot more challenging to account for all the patterns. But this example gives us an entry point to regular expressions and using them within the context of stringr.
You can also learn a lot more about working with strings here: Handling Strings with R.