Cleaning Messy Strings Using stringr and Regexps

Alice Tivarovsky 2022-09-18 21 minute read

Motivation

I recently completed several projects that involved cleaning lots of messy data, in particular, free text strings that I needed to organize into tidy variables and text-mine for insights. To do this efficiently, I learned how to use regular expressions (regexps), as suggested by Hadley Wickham and Garrett Grolemund’s in R for Data Science.

As the authors suggest, it takes a little while to learn the rules of regexps and get comfortable using them, but once you do, you’ll see they’re a powerful tool to have in your data cleaning toolbox. The best resources I found for learning and practicing regexps are the RStudio stringr cheatsheet (found here) and this handy interactive program.

Below is a bit of practice with stringr and regexps using Wikipedia’s list of tallest people on record. I realize this is not exactly a health topic, but it’s a simple motivating example that can be leveraged to any messy dataset containing strings. We’ll be using the rvest package to scrape an html table from the Wikipedia page, as outlined here and here.

Import

The original table looks like this:

The steps below scrape the data using rvest.

# wikipedia source url
url <- 'https://en.wikipedia.org/wiki/List_of_tallest_people'

# specify table
tall_table <- 
  url %>% 
  read_html() %>% 
  html_nodes(xpath = ('//*[@id="mw-content-text"]/div[1]/table[2]')) %>% 
  html_table()
               
tall_table <- 
  tall_table[[1]] %>% 
  janitor::clean_names()

tall_table %>% 
  select(c(metric, lifespan_age_at_death, imperial, note))

## # A tibble: 143 × 4
##    metric   lifespan_age_at_death imperial     note                             
##    <chr>    <chr>                 <chr>        <chr>                            
##  1 272 cm   1918–1940 (22)        8 ft 11.1 in Tallest human in recorded histor…
##  2 269 cm   1989–2015 (26)        8 ft 10 in   Was not measured officially by G…
##  3 268 cm   1986–2007 (21)        8 ft 9.5 in  Was not measured officially by G…
##  4 267 cm   1868–1905 (37)        8 ft 9 in    Second tallest male in recorded …
##  5 263.5 cm 1932–1969 (37)        8 ft 7.75 in 8 ft 0 in (244 cm) standing heig…
##  6 262 cm   1924–1943 (18)        8 ft 7 in    Billed at 8 ft 7 in (262 cm).[16]
##  7 262 cm   1906–1952 (46)        8 ft 7 in    Jacob Rheuben Ehrlich. American …
##  8 258 cm   1860–1887 (27)        8 ft 5.5 in  Tallest human in European histor…
##  9 258 cm   1970–2014 (44)        8 ft 5.5 in  Not officially recognized by Gui…
## 10 257 cm   1984–2019 (34)        8 ft 5.25 in After he died he was measured po…
## # … with 133 more rows

Clearly this requires some data cleaning. Specifically, we need to address the following problems:
- The variables metric and imperial mix numbers and text to express height in their respective units
- The lifespan_age_at_death variable combines three numeric variables into one column
- The note columns often contains reference numbers in brackets

Tidy

1. The `metric` Variable

First we’ll clean up metric by removing the “cm” on the end of every observation, converting it to a numeric variable, and updating the variable name.

# remove "cm"
tall_table_1 <- 
  tall_table %>% 
  mutate(metric = str_replace(metric, "cm", ""))

head(tall_table_1$metric)

## [1] "272 "   "269 "   "268 "   "267 "   "263.5 " "262 "

Here we run into our first problem: trailing blanks. When we removed “cm” from the strings, we created phantom blanks, as evidenced by the space between the last digit and the end quote. These will prevent us from cleanly coercing the characters to numbers:

tall_table_test <- 
  tall_table_1 %>% 
  mutate(metric = as.numeric(metric))

head(tall_table_test$metric)

## [1] NA NA NA NA NA NA

This is very easy to fix with str_trim:

# trim trailing blanks
tall_table_2 <- 
  tall_table_1 %>% 
  mutate(metric = str_trim(metric, side = "both"), 
         metric = as.numeric(metric)) %>% 
  rename(ht_cm = metric)
  
tall_table_2$ht_cm

##   [1] 272.00 269.00 268.00 267.00 263.50 262.00 262.00 258.00 258.00 257.00
##  [11] 256.00 255.00 251.40 251.00 250.00 248.90 248.90 246.00 246.00 246.00
##  [21] 245.70 245.00 245.00 244.00 244.00 244.00 244.00 244.00 244.00 244.00
##  [31] 242.00 242.00 242.00 242.00 241.00 241.00 240.00 240.00 240.00 239.00
##  [41] 238.00 238.00 237.00 236.00 236.00 236.00 236.00 236.00 236.00 236.00
##  [51] 236.00 232.00 236.00 236.17 236.10 232.00 235.90 233.50 235.00 235.00
##  [61] 235.00 235.00 235.00 235.00 234.00 234.00 234.00 234.00 234.00 234.00
##  [71] 234.00 234.00 234.00 234.00 234.00 233.00 233.00 233.00 233.00 233.00
##  [81] 231.00 233.00 231.00 231.00 231.00 231.00 231.00 231.00 231.00 231.00
##  [91] 231.00 230.00 230.00 229.00 229.00 229.00 229.00 229.00 229.00 229.00
## [101] 229.00 229.00 229.00 229.00 229.00 229.00 229.00 229.00 229.00 229.00
## [111] 229.00 228.00 228.00 228.00 228.00 228.00 227.00 226.00 226.00 226.00
## [121] 226.00 226.00 226.00 225.00 225.00 225.00 225.00 224.00 224.00 223.00
## [131] 222.00 222.00 222.00 221.00 221.00 221.00 221.00 221.00 218.00 218.00
## [141] 218.00 216.00 210.00

Now we move on to the imperial variable, which takes on the following (un-tidy) values:

tall_table_2$imperial

##   [1] "8 ft 11.1 in"                                                                                                                                                                                                                                                                                                                                           
##   [2] "8 ft 10 in"                                                                                                                                                                                                                                                                                                                                             
##   [3] "8 ft 9.5 in"                                                                                                                                                                                                                                                                                                                                            
##   [4] "8 ft 9 in"                                                                                                                                                                                                                                                                                                                                              
##   [5] "8 ft 7.75 in"                                                                                                                                                                                                                                                                                                                                           
##   [6] "8 ft 7 in"                                                                                                                                                                                                                                                                                                                                              
##   [7] "8 ft 7 in"                                                                                                                                                                                                                                                                                                                                              
##   [8] "8 ft 5.5 in"                                                                                                                                                                                                                                                                                                                                            
##   [9] "8 ft 5.5 in"                                                                                                                                                                                                                                                                                                                                            
##  [10] "8 ft 5.25 in"                                                                                                                                                                                                                                                                                                                                           
##  [11] "8 ft 4.8 in"                                                                                                                                                                                                                                                                                                                                            
##  [12] "8 ft 4.5 in"                                                                                                                                                                                                                                                                                                                                            
##  [13] "8 ft 3 in"                                                                                                                                                                                                                                                                                                                                              
##  [14] "8 ft 2.8 in[25]"                                                                                                                                                                                                                                                                                                                                        
##  [15] "8 ft 2.4 in"                                                                                                                                                                                                                                                                                                                                            
##  [16] "8 ft 2 in"                                                                                                                                                                                                                                                                                                                                              
##  [17] "8 ft 2 in"                                                                                                                                                                                                                                                                                                                                              
##  [18] "8 ft 1 in[29]"                                                                                                                                                                                                                                                                                                                                          
##  [19] "8 ft 1 in"                                                                                                                                                                                                                                                                                                                                              
##  [20] "8 ft 1 in"                                                                                                                                                                                                                                                                                                                                              
##  [21] "8 ft 0.75 in"                                                                                                                                                                                                                                                                                                                                           
##  [22] "8 ft 0.5 in"                                                                                                                                                                                                                                                                                                                                            
##  [23] "8 ft 0.4 in"                                                                                                                                                                                                                                                                                                                                            
##  [24] "8 ft 0 in-8 ft 1 in"                                                                                                                                                                                                                                                                                                                                    
##  [25] "8 ft 0 in"                                                                                                                                                                                                                                                                                                                                              
##  [26] "8 ft 0 in"                                                                                                                                                                                                                                                                                                                                              
##  [27] "8 ft 0 in"                                                                                                                                                                                                                                                                                                                                              
##  [28] "8 ft 0 in"                                                                                                                                                                                                                                                                                                                                              
##  [29] "8 ft 0 in"                                                                                                                                                                                                                                                                                                                                              
##  [30] "8 ft 0 in"                                                                                                                                                                                                                                                                                                                                              
##  [31] "7 ft 11.5 in"                                                                                                                                                                                                                                                                                                                                           
##  [32] "7 ft 11.5 in"                                                                                                                                                                                                                                                                                                                                           
##  [33] "7 ft 11.5 in"                                                                                                                                                                                                                                                                                                                                           
##  [34] "7 ft 11.5 in"                                                                                                                                                                                                                                                                                                                                           
##  [35] "7 ft 11 in"                                                                                                                                                                                                                                                                                                                                             
##  [36] "7 ft 11 in"                                                                                                                                                                                                                                                                                                                                             
##  [37] "7 ft 10.5 in"                                                                                                                                                                                                                                                                                                                                           
##  [38] "7 ft 10.5 in"                                                                                                                                                                                                                                                                                                                                           
##  [39] "7 ft 10.5 in"                                                                                                                                                                                                                                                                                                                                           
##  [40] "7 ft 10 in"                                                                                                                                                                                                                                                                                                                                             
##  [41] "7 ft 10 in"                                                                                                                                                                                                                                                                                                                                             
##  [42] "7 ft 10 in"                                                                                                                                                                                                                                                                                                                                             
##  [43] "7 ft 9.3 in"                                                                                                                                                                                                                                                                                                                                            
##  [44] "7 ft 9 in"                                                                                                                                                                                                                                                                                                                                              
##  [45] "7 ft 9 in"                                                                                                                                                                                                                                                                                                                                              
##  [46] "7 ft 9 in"                                                                                                                                                                                                                                                                                                                                              
##  [47] "7 ft 9 in"                                                                                                                                                                                                                                                                                                                                              
##  [48] "7 ft 9 in"                                                                                                                                                                                                                                                                                                                                              
##  [49] "7 ft 9 in"                                                                                                                                                                                                                                                                                                                                              
##  [50] "7 ft 9 in"                                                                                                                                                                                                                                                                                                                                              
##  [51] "7 ft 9 in"                                                                                                                                                                                                                                                                                                                                              
##  [52] "7 ft 7.3 in"                                                                                                                                                                                                                                                                                                                                            
##  [53] "7 ft 9 in"                                                                                                                                                                                                                                                                                                                                              
##  [54] "7 ft 8.98 in"                                                                                                                                                                                                                                                                                                                                           
##  [55] "7 ft 8.95 in"                                                                                                                                                                                                                                                                                                                                           
##  [56] "7 ft 7¼ in"                                                                                                                                                                                                                                                                                                                                             
##  [57] "7 ft 8.87 in"                                                                                                                                                                                                                                                                                                                                           
##  [58] "7 ft .mw-parser-output .frac{white-space:nowrap}.mw-parser-output .frac .num,.mw-parser-output .frac .den{font-size:80%;line-height:0;vertical-align:super}.mw-parser-output .frac .den{vertical-align:sub}.mw-parser-output .sr-only{border:0;clip:rect(0,0,0,0);height:1px;margin:-1px;overflow:hidden;padding:0;position:absolute;width:1px}8+1⁄3 in"
##  [59] "7 ft 81⁄2 in"                                                                                                                                                                                                                                                                                                                                           
##  [60] "7 ft 81⁄2 in"                                                                                                                                                                                                                                                                                                                                           
##  [61] "7 ft 8+1⁄2 in"                                                                                                                                                                                                                                                                                                                                          
##  [62] "7 ft 8+1⁄2 in"                                                                                                                                                                                                                                                                                                                                          
##  [63] "7 ft 8 in"                                                                                                                                                                                                                                                                                                                                              
##  [64] "7 ft 8 in"                                                                                                                                                                                                                                                                                                                                              
##  [65] "7 ft 8 in"                                                                                                                                                                                                                                                                                                                                              
##  [66] "7 ft 8 in"                                                                                                                                                                                                                                                                                                                                              
##  [67] "7 ft 8 in"                                                                                                                                                                                                                                                                                                                                              
##  [68] "7 ft 8 in"                                                                                                                                                                                                                                                                                                                                              
##  [69] "7 ft 8 in"                                                                                                                                                                                                                                                                                                                                              
##  [70] "7 ft 8 in"                                                                                                                                                                                                                                                                                                                                              
##  [71] "7 ft 8 in"                                                                                                                                                                                                                                                                                                                                              
##  [72] "7 ft 8 in"                                                                                                                                                                                                                                                                                                                                              
##  [73] "7 ft 8 in"                                                                                                                                                                                                                                                                                                                                              
##  [74] "7 ft 8 in"                                                                                                                                                                                                                                                                                                                                              
##  [75] "7 ft 8 in"                                                                                                                                                                                                                                                                                                                                              
##  [76] "7 ft 7.7 in"                                                                                                                                                                                                                                                                                                                                            
##  [77] "7 ft 7.7 in"                                                                                                                                                                                                                                                                                                                                            
##  [78] "7 ft 7+1⁄2 in"                                                                                                                                                                                                                                                                                                                                          
##  [79] "7 ft 7.4 in"                                                                                                                                                                                                                                                                                                                                            
##  [80] "7 ft 7.4 in"                                                                                                                                                                                                                                                                                                                                            
##  [81] "7 ft 7.26 in"                                                                                                                                                                                                                                                                                                                                           
##  [82] "7 ft 7 in"                                                                                                                                                                                                                                                                                                                                              
##  [83] "7 ft 7 in"                                                                                                                                                                                                                                                                                                                                              
##  [84] "7 ft 7 in"                                                                                                                                                                                                                                                                                                                                              
##  [85] "7 ft 7 in"                                                                                                                                                                                                                                                                                                                                              
##  [86] "7 ft 7 in"                                                                                                                                                                                                                                                                                                                                              
##  [87] "7 ft 7 in"                                                                                                                                                                                                                                                                                                                                              
##  [88] "7 ft 7 in"                                                                                                                                                                                                                                                                                                                                              
##  [89] "7 ft 7 in"                                                                                                                                                                                                                                                                                                                                              
##  [90] "7 ft 7 in"                                                                                                                                                                                                                                                                                                                                              
##  [91] "7 ft 7 in"                                                                                                                                                                                                                                                                                                                                              
##  [92] "7 ft 7 in"                                                                                                                                                                                                                                                                                                                                              
##  [93] "7 ft 6.75 in"                                                                                                                                                                                                                                                                                                                                           
##  [94] "7 ft 6.25 in"                                                                                                                                                                                                                                                                                                                                           
##  [95] "7 ft 6.25 in"                                                                                                                                                                                                                                                                                                                                           
##  [96] "7 ft 6+1⁄4 in"                                                                                                                                                                                                                                                                                                                                          
##  [97] "7 ft 6+1⁄5 in"                                                                                                                                                                                                                                                                                                                                          
##  [98] "7 ft 6 in"                                                                                                                                                                                                                                                                                                                                              
##  [99] "7 ft 6 in"                                                                                                                                                                                                                                                                                                                                              
## [100] "7 ft 6 in"                                                                                                                                                                                                                                                                                                                                              
## [101] "7 ft 6 in"                                                                                                                                                                                                                                                                                                                                              
## [102] "7 ft 6 in"                                                                                                                                                                                                                                                                                                                                              
## [103] "7 ft 6 in"                                                                                                                                                                                                                                                                                                                                              
## [104] "7 ft 6 in"                                                                                                                                                                                                                                                                                                                                              
## [105] "7 ft 6 in"                                                                                                                                                                                                                                                                                                                                              
## [106] "7 ft 6 in"                                                                                                                                                                                                                                                                                                                                              
## [107] "7 ft 6 in"                                                                                                                                                                                                                                                                                                                                              
## [108] "7 ft 6 in"                                                                                                                                                                                                                                                                                                                                              
## [109] "7 ft 6 in"                                                                                                                                                                                                                                                                                                                                              
## [110] "7 ft 6 in"                                                                                                                                                                                                                                                                                                                                              
## [111] "7 ft 5.8 in"                                                                                                                                                                                                                                                                                                                                            
## [112] "7 ft 5+2⁄3 in"                                                                                                                                                                                                                                                                                                                                          
## [113] "7 ft 5+2⁄3 in"                                                                                                                                                                                                                                                                                                                                          
## [114] "7 ft 5+2⁄3 in"                                                                                                                                                                                                                                                                                                                                          
## [115] "7 ft 5+2⁄3 in"                                                                                                                                                                                                                                                                                                                                          
## [116] "7 ft 5+2⁄3 in"                                                                                                                                                                                                                                                                                                                                          
## [117] "7 ft 5+1⁄3"                                                                                                                                                                                                                                                                                                                                             
## [118] "7 ft 5 in"                                                                                                                                                                                                                                                                                                                                              
## [119] "7 ft 5 in"                                                                                                                                                                                                                                                                                                                                              
## [120] "7 ft 5 in"                                                                                                                                                                                                                                                                                                                                              
## [121] "7 ft 5 in"                                                                                                                                                                                                                                                                                                                                              
## [122] "7 ft 5 in"                                                                                                                                                                                                                                                                                                                                              
## [123] "7 ft 5 in"                                                                                                                                                                                                                                                                                                                                              
## [124] "7 ft 4.5 in"                                                                                                                                                                                                                                                                                                                                            
## [125] "7 ft 4+1⁄2 in"                                                                                                                                                                                                                                                                                                                                          
## [126] "7 ft 4+1⁄2 in"                                                                                                                                                                                                                                                                                                                                          
## [127] "7 ft 4.5 in"                                                                                                                                                                                                                                                                                                                                            
## [128] "7 ft 4 in"                                                                                                                                                                                                                                                                                                                                              
## [129] "7 ft 4 in"                                                                                                                                                                                                                                                                                                                                              
## [130] "7 ft 4 in"                                                                                                                                                                                                                                                                                                                                              
## [131] "7 ft 3.4 in"                                                                                                                                                                                                                                                                                                                                            
## [132] "7 ft 3.5 in"                                                                                                                                                                                                                                                                                                                                            
## [133] "7 ft 3.5 in"                                                                                                                                                                                                                                                                                                                                            
## [134] "7 ft 3 in"                                                                                                                                                                                                                                                                                                                                              
## [135] "7 ft 3 in"                                                                                                                                                                                                                                                                                                                                              
## [136] "7 ft 3 in"                                                                                                                                                                                                                                                                                                                                              
## [137] "7 ft 3 in"                                                                                                                                                                                                                                                                                                                                              
## [138] "7 ft 3 in"                                                                                                                                                                                                                                                                                                                                              
## [139] "7 ft 2 in"                                                                                                                                                                                                                                                                                                                                              
## [140] "7 ft 2 in"                                                                                                                                                                                                                                                                                                                                              
## [141] "7 ft 2 in"                                                                                                                                                                                                                                                                                                                                              
## [142] "7 ft 1 in"                                                                                                                                                                                                                                                                                                                                              
## [143] "7 ft 0 in"

We’ve got numbers with brackets (links to footnotes in the original Wikipedia page), we’ve got some inches measures that look like “1.5 inches”, some that look like “1+1/2 in”, we have some that drop the plus (“81/2 in”, which should probably read “8 1/2 in”), and then we have one observation that seems to have carried over some style code.

First we’ll fix the style code observation by manually overwriting the value using the original table. We’ll also manually fix the observation that lists a height range (“8 ft 0 in-8 ft 1 in”) since for our purposes, there’s not much difference between an 8-foot person and an 8-foot-one person - they’re both very, very tall.

# manual fixes
tall_table_3 <- 
  tall_table_2 %>% 
  mutate(imperial = 
           case_when(name == "Rafael França do Nascimento" ~ "7 ft 8+1⁄3 in", 
                     name == "James Toller" ~ "8 ft 0 in", 
                     TRUE ~ imperial) , 
         imperial = str_trim(imperial, side = "both")
  )

tall_table_3$imperial

##   [1] "8 ft 11.1 in"    "8 ft 10 in"      "8 ft 9.5 in"     "8 ft 9 in"      
##   [5] "8 ft 7.75 in"    "8 ft 7 in"       "8 ft 7 in"       "8 ft 5.5 in"    
##   [9] "8 ft 5.5 in"     "8 ft 5.25 in"    "8 ft 4.8 in"     "8 ft 4.5 in"    
##  [13] "8 ft 3 in"       "8 ft 2.8 in[25]" "8 ft 2.4 in"     "8 ft 2 in"      
##  [17] "8 ft 2 in"       "8 ft 1 in[29]"   "8 ft 1 in"       "8 ft 1 in"      
##  [21] "8 ft 0.75 in"    "8 ft 0.5 in"     "8 ft 0.4 in"     "8 ft 0 in"      
##  [25] "8 ft 0 in"       "8 ft 0 in"       "8 ft 0 in"       "8 ft 0 in"      
##  [29] "8 ft 0 in"       "8 ft 0 in"       "7 ft 11.5 in"    "7 ft 11.5 in"   
##  [33] "7 ft 11.5 in"    "7 ft 11.5 in"    "7 ft 11 in"      "7 ft 11 in"     
##  [37] "7 ft 10.5 in"    "7 ft 10.5 in"    "7 ft 10.5 in"    "7 ft 10 in"     
##  [41] "7 ft 10 in"      "7 ft 10 in"      "7 ft 9.3 in"     "7 ft 9 in"      
##  [45] "7 ft 9 in"       "7 ft 9 in"       "7 ft 9 in"       "7 ft 9 in"      
##  [49] "7 ft 9 in"       "7 ft 9 in"       "7 ft 9 in"       "7 ft 7.3 in"    
##  [53] "7 ft 9 in"       "7 ft 8.98 in"    "7 ft 8.95 in"    "7 ft 7¼ in"     
##  [57] "7 ft 8.87 in"    "7 ft 8+1⁄3 in"   "7 ft 81⁄2 in"    "7 ft 81⁄2 in"   
##  [61] "7 ft 8+1⁄2 in"   "7 ft 8+1⁄2 in"   "7 ft 8 in"       "7 ft 8 in"      
##  [65] "7 ft 8 in"       "7 ft 8 in"       "7 ft 8 in"       "7 ft 8 in"      
##  [69] "7 ft 8 in"       "7 ft 8 in"       "7 ft 8 in"       "7 ft 8 in"      
##  [73] "7 ft 8 in"       "7 ft 8 in"       "7 ft 8 in"       "7 ft 7.7 in"    
##  [77] "7 ft 7.7 in"     "7 ft 7+1⁄2 in"   "7 ft 7.4 in"     "7 ft 7.4 in"    
##  [81] "7 ft 7.26 in"    "7 ft 7 in"       "7 ft 7 in"       "7 ft 7 in"      
##  [85] "7 ft 7 in"       "7 ft 7 in"       "7 ft 7 in"       "7 ft 7 in"      
##  [89] "7 ft 7 in"       "7 ft 7 in"       "7 ft 7 in"       "7 ft 7 in"      
##  [93] "7 ft 6.75 in"    "7 ft 6.25 in"    "7 ft 6.25 in"    "7 ft 6+1⁄4 in"  
##  [97] "7 ft 6+1⁄5 in"   "7 ft 6 in"       "7 ft 6 in"       "7 ft 6 in"      
## [101] "7 ft 6 in"       "7 ft 6 in"       "7 ft 6 in"       "7 ft 6 in"      
## [105] "7 ft 6 in"       "7 ft 6 in"       "7 ft 6 in"       "7 ft 6 in"      
## [109] "7 ft 6 in"       "7 ft 6 in"       "7 ft 5.8 in"     "7 ft 5+2⁄3 in"  
## [113] "7 ft 5+2⁄3 in"   "7 ft 5+2⁄3 in"   "7 ft 5+2⁄3 in"   "7 ft 5+2⁄3 in"  
## [117] "7 ft 5+1⁄3"      "7 ft 5 in"       "7 ft 5 in"       "7 ft 5 in"      
## [121] "7 ft 5 in"       "7 ft 5 in"       "7 ft 5 in"       "7 ft 4.5 in"    
## [125] "7 ft 4+1⁄2 in"   "7 ft 4+1⁄2 in"   "7 ft 4.5 in"     "7 ft 4 in"      
## [129] "7 ft 4 in"       "7 ft 4 in"       "7 ft 3.4 in"     "7 ft 3.5 in"    
## [133] "7 ft 3.5 in"     "7 ft 3 in"       "7 ft 3 in"       "7 ft 3 in"      
## [137] "7 ft 3 in"       "7 ft 3 in"       "7 ft 2 in"       "7 ft 2 in"      
## [141] "7 ft 2 in"       "7 ft 1 in"       "7 ft 0 in"

Next, we’ll remove the un-needed reference numbers in brackets. We can do that manually, too, but it’s not exactly a big-data approach, so let’s use regular expressions (regexps) instead.

The challenge here is that brackets are a special character in regex, so we’ll need to “escape” them (meaning, get R to understand that we mean a literal bracket). The stringr cheat sheet serves as a great summary. You’ll notice that to escape a bracket, you need a double backslash in front of it.

Here’s a demo of how it works:

# demonstrate regexps
tall_table_3 %>% 
  mutate(test = str_extract(imperial, "\\[\\d*\\]")) %>% 
  filter(str_detect(imperial, "\\[\\d*\\]")) %>% 
  select(imperial, test)

## # A tibble: 2 × 2
##   imperial        test 
##   <chr>           <chr>
## 1 8 ft 2.8 in[25] [25] 
## 2 8 ft 1 in[29]   [29]

tall_table_4 <- 
  tall_table_3 %>% 
  mutate(imperial = str_remove(imperial, "\\[\\d*\\]"))

Next, we need to separate the feet and inches values so we can convert to a value like XX.XX inches, allowing us to do numerical manipulation on the value. The pattern in imperial is now “X ft XX in”, where the inches are expressed in different ways (1/2, 0.5). Let’s do inches first. We want to extract out the number that lies between “ft” and “in”. For this, we can use “look around” regexps, that find a pattern preceding or following another pattern. Roughly speaking:
- “a(?=c)” means “a” followed by “c”
- “(?<=b)a” means “a” preceded by “b”
- Note that “[:graph:]” means any letter, number, or punctuation, and the asterisk after it means repeated zero ore more times.

# demo
tall_table_4 %>% 
  mutate(inches = str_extract(imperial, "(?<=ft)\\s[:graph:]*\\s(?=in)")) %>% 
  select(imperial, inches)

## # A tibble: 143 × 2
##    imperial     inches  
##    <chr>        <chr>   
##  1 8 ft 11.1 in " 11.1 "
##  2 8 ft 10 in   " 10 "  
##  3 8 ft 9.5 in  " 9.5 " 
##  4 8 ft 9 in    " 9 "   
##  5 8 ft 7.75 in " 7.75 "
##  6 8 ft 7 in    " 7 "   
##  7 8 ft 7 in    " 7 "   
##  8 8 ft 5.5 in  " 5.5 " 
##  9 8 ft 5.5 in  " 5.5 " 
## 10 8 ft 5.25 in " 5.25 "
## # … with 133 more rows

# extract inches value and remove everything from "ft" onward  
tall_table_5 <- 
  tall_table_4 %>% 
  mutate(inches = str_extract(imperial, "(?<=ft)\\s[:graph:]*\\s(?=in)"),
         imperial = str_remove(imperial, "\\s(ft)\\s[:graph:]*\\s[:graph:]*")) %>% 
  rename(ht_ft = imperial)

# trim white-space
tall_table_6 <- 
  tall_table_5 %>% 
  mutate(inches = str_trim(inches, side = c("both")), 
         ht_ft = str_trim(ht_ft, side = c("both")))

Now, we get to the most challenging part: transforming inches into a numerical variable. This means converting numbers written as fractions into decimals, while not transforming the already-clean decimal values. The simplest way to do this is probably to find a forward-slash (/), and find the numbers immediately before and after it. After that, we’ll separate out the numbers and divide them to arrive at a final, numerical inches_decimal.

# clean inches variable
tall_table_7 <- 
  tall_table_6 %>% 
  mutate(inches_fraction = str_extract(inches, "\\d\\⁄\\d"), 
         inches = str_remove(inches, "\\d\\⁄\\d"), 
         inches = str_remove(inches, "\\+"), 
         inches = str_trim(inches, side = c("both"))) %>% 
  mutate(inches_1 = as.numeric(str_sub(inches_fraction, 1, 1)), 
         inches_2 = as.numeric(str_sub(inches_fraction, -1, -1)), 
         inches_decimal = round((inches_1/inches_2), digits = 2))

# demo
tall_table_7 %>% 
  filter(!is.na(inches_decimal)) %>% 
  select(inches, inches_fraction, inches_1, inches_2, inches_decimal)

## # A tibble: 15 × 5
##    inches inches_fraction inches_1 inches_2 inches_decimal
##    <chr>  <chr>              <dbl>    <dbl>          <dbl>
##  1 8      1⁄3                    1        3           0.33
##  2 8      1⁄2                    1        2           0.5 
##  3 8      1⁄2                    1        2           0.5 
##  4 8      1⁄2                    1        2           0.5 
##  5 8      1⁄2                    1        2           0.5 
##  6 7      1⁄2                    1        2           0.5 
##  7 6      1⁄4                    1        4           0.25
##  8 6      1⁄5                    1        5           0.2 
##  9 5      2⁄3                    2        3           0.67
## 10 5      2⁄3                    2        3           0.67
## 11 5      2⁄3                    2        3           0.67
## 12 5      2⁄3                    2        3           0.67
## 13 5      2⁄3                    2        3           0.67
## 14 4      1⁄2                    1        2           0.5 
## 15 4      1⁄2                    1        2           0.5

Finally, we will add inches to inches_decimal to arrive at the final portion of height in inches. We’ll also address the outlier “7¼” inches. Then, we’ll convert the now-clean inches to feet, and add it to ht_ft.

tall_table_8 <- 
  tall_table_7 %>% 
  mutate(inches = if_else(inches == "7¼", "7.25", inches), 
         inches = as.numeric(inches), 
         inches = if_else(is.na(inches), 0, inches), 
         in_to_feet = round((inches/12), digits = 2), 
         ht_ft = if_else(name ==  "Jeison Rodríguez", "7.44", ht_ft),
         ht_ft = round(as.numeric(ht_ft), digits = 2), 
         ht_ft = ht_ft + in_to_feet
  ) %>% 
  select(-c(inches_fraction, inches, inches_1, inches_2, inches_decimal, in_to_feet))

Now we can do a fun exercise - we can check our work by directly converting ht_cm to height in feet (one foot is 30.48 cm), and seeing if all the wrangling we underwent with inches and feet gave us correct numbers.

# test whether calculated value matches string extract
test <-
  tall_table_8 %>% 
  select(ht_ft, ht_cm) %>% 
  mutate(check_ht = round(ht_ft*30.48, digits = 2))

test

## # A tibble: 143 × 3
##    ht_ft ht_cm check_ht
##    <dbl> <dbl>    <dbl>
##  1  8.92  272      272.
##  2  8.83  269      269.
##  3  8.79  268      268.
##  4  8.75  267      267.
##  5  8.65  264.     264.
##  6  8.58  262      262.
##  7  8.58  262      262.
##  8  8.46  258      258.
##  9  8.46  258      258.
## 10  8.44  257      257.
## # … with 133 more rows

We’re not dead on, but we got pretty close. The reason for the error has a lot more to do with rounding in the original table than our text-parsing. And after all that work, our data-frame is tidy-ish.

2. The `lifespan_age_at_death` Variable

This is what the lifespan_age_at_death column looks like now:

tall_table_8 %>% 
  select(lifespan_age_at_death)

## # A tibble: 143 × 1
##    lifespan_age_at_death
##    <chr>                
##  1 1918–1940 (22)       
##  2 1989–2015 (26)       
##  3 1986–2007 (21)       
##  4 1868–1905 (37)       
##  5 1932–1969 (37)       
##  6 1924–1943 (18)       
##  7 1906–1952 (46)       
##  8 1860–1887 (27)       
##  9 1970–2014 (44)       
## 10 1984–2019 (34)       
## # … with 133 more rows

We want the year of birth, year of death, and age at death to be their own variables. Scanning the values, there are several different scenarios:
1. Birth year - death year (YYYY-YYYY) followed by (age or age range)
2. Range for birth year and death year (YYYY/YY - YYYY/YY)
3. “born” followed by a year of birth (YYYY) and no death year
4. Some text followed by “?” followed by a year
5. Year of birth
6. “born” followed by a range and some text
7. Some entries contain a reference number in brackets

First we’ll remove brackets and anything between them, addressing scenario 7. Then we’ll populate an age_at_death variable by extracting anything in parentheses.

# define regular expression for any string between parentheses, not including a parenthesis
regex_bet_paren <- "\\([^)]+\\)"

tall_table_9 <- 
  tall_table_8 %>% 
  mutate(lifespan_age_at_death = str_remove_all(lifespan_age_at_death, "\\[\\d*\\]"), 
         age_at_death = str_extract(lifespan_age_at_death, regex_bet_paren), 
         lifespan_age_at_death = str_remove_all(lifespan_age_at_death, regex_bet_paren),
         age_at_death = str_remove_all(age_at_death, "\\(|\\)"),  
         lifespan_age_at_death = str_trim(lifespan_age_at_death, side = "both"),
         age_at_death = str_trim(age_at_death, side = "both"), 
         age_at_death = if_else(age_at_death == "aged 24 as of March 2022", "", age_at_death)
  )

Next we separate birth year and death year by the hyphen delimiter. Note the regexp meanings:
- \\d* means any quantity of consecutive digits
- (?=–) means something followed by a hyphen
- (?<=–) means something following a hyphen

tall_table_10 <- 
  tall_table_9 %>% 
  mutate(birth_year = str_extract(lifespan_age_at_death, "\\d*(?=–)"), 
         death_year = str_extract(lifespan_age_at_death, "(?<=–)\\d*"))

tall_table_10

## # A tibble: 143 × 9
##    country       ht_cm ht_ft name          note  lifes…¹ age_a…² birth…³ death…⁴
##    <chr>         <dbl> <dbl> <chr>         <chr> <chr>   <chr>   <chr>   <chr>  
##  1 United States  272   8.92 Robert Wadlow Tall… 1918–1… 22      1918    1940   
##  2 Thailand       269   8.83 Pornchai Sao… Was … 1989–2… 26      1989    2015   
##  3 India          268   8.79 Vikas Uppal   Was … 1986–2… 21      1986    2007   
##  4 United States  267   8.75 John Rogan    Seco… 1868–1… 37      1868    1905   
##  5 United States  264.  8.65 John F. Carr… 8 ft… 1932–1… 37      1932    1969   
##  6 United States  262   8.58 Willie Camper Bill… 1924–1… 18      1924    1943   
##  7 United States  262   8.58 Jack Earle    Jaco… 1906–1… 46      1906    1952   
##  8 Austria        258   8.46 Franz Winkel… Tall… 1860–1… 27      1860    1887   
##  9 Ukraine        258   8.46 Leonid Stadn… Not … 1970–2… 44      1970    2014   
## 10 Vietnam        257   8.44 Hồ Văn Trung  Afte… 1984–2… 34      1984    2019   
## # … with 133 more rows, and abbreviated variable names ¹lifespan_age_at_death,
## #   ²age_at_death, ³birth_year, ⁴death_year

Next we’ll address scenario 3: “born” followed by a year of birth (YYYY) and no death year. We’ll also make some manual fixes for a few observtions that don’t translate using our rules.

# populate birth year using a number preceded by "born"
tall_table_11 <- 
  tall_table_10 %>% 
  mutate(birth_year = if_else(is.na(birth_year), str_extract(lifespan_age_at_death, "(?<=born )\\d*"), birth_year))

# manual fixes
tall_table_12 <- 
  tall_table_11 %>% 
  mutate(birth_year = case_when(lifespan_age_at_death == "1958-1992" ~ "1958",
                                lifespan_age_at_death == "c. born 1995" ~ "1995",
                                lifespan_age_at_death == "1844-1899" ~"1844", 
                                lifespan_age_at_death == "1946-1993" ~"1946",
                                lifespan_age_at_death == "1880/85 – 1925/30" ~ "1880",
                                TRUE ~ birth_year), 
         death_year = case_when(lifespan_age_at_death == "1958-1992" ~ "1992", 
                                lifespan_age_at_death == "fl.?-March 2019" ~ "2019", 
                                lifespan_age_at_death == "1844-1899" ~ "1899",
                                lifespan_age_at_death == "1946-1993" ~ "1993", 
                                lifespan_age_at_death == "1880/85 – 1925/30" ~ "1925", TRUE ~ death_year)) %>% 
  mutate(birth_year = as.numeric(birth_year), 
         death_year = as.numeric(death_year)) %>% 
  filter(!lifespan_age_at_death == "173–238")


# check
tall_table_12 %>% 
  select(lifespan_age_at_death, birth_year, death_year) %>% 
  filter(is.na(birth_year) | is.na(death_year))

## # A tibble: 65 × 3
##    lifespan_age_at_death birth_year death_year
##    <chr>                      <dbl>      <dbl>
##  1 born 1982                   1982         NA
##  2 born 1982                   1982         NA
##  3 born 1987                   1987         NA
##  4 born 1983                   1983         NA
##  5 1835–?                      1835         NA
##  6 born 1987                   1987         NA
##  7 born 1994                   1994         NA
##  8 born 1999                   1999         NA
##  9 fl. ?–1596                    NA       1596
## 10 born 1966                   1966         NA
## # … with 55 more rows

3. The `note` Variable

Finally, we’ll clean up note by removing anything between brackets.

# clean note variable
tall_table_13 <- 
  tall_table_12 %>% 
  mutate(note = str_remove_all(note, "\\[\\d*\\]"))

tall_table_13

## # A tibble: 142 × 9
##    country       ht_cm ht_ft name          note  lifes…¹ age_a…² birth…³ death…⁴
##    <chr>         <dbl> <dbl> <chr>         <chr> <chr>   <chr>     <dbl>   <dbl>
##  1 United States  272   8.92 Robert Wadlow Tall… 1918–1… 22         1918    1940
##  2 Thailand       269   8.83 Pornchai Sao… Was … 1989–2… 26         1989    2015
##  3 India          268   8.79 Vikas Uppal   Was … 1986–2… 21         1986    2007
##  4 United States  267   8.75 John Rogan    Seco… 1868–1… 37         1868    1905
##  5 United States  264.  8.65 John F. Carr… 8 ft… 1932–1… 37         1932    1969
##  6 United States  262   8.58 Willie Camper Bill… 1924–1… 18         1924    1943
##  7 United States  262   8.58 Jack Earle    Jaco… 1906–1… 46         1906    1952
##  8 Austria        258   8.46 Franz Winkel… Tall… 1860–1… 27         1860    1887
##  9 Ukraine        258   8.46 Leonid Stadn… Not … 1970–2… 44         1970    2014
## 10 Vietnam        257   8.44 Hồ Văn Trung  Afte… 1984–2… 34         1984    2019
## # … with 132 more rows, and abbreviated variable names ¹lifespan_age_at_death,
## #   ²age_at_death, ³birth_year, ⁴death_year

Conclusion

That’s more or less all there is to it. If you’re working with big data, it’ll be a lot more challenging to account for all the patterns. But this example gives us an entry point to regular expressions and using them within the context of stringr.

You can also learn a lot more about working with strings here: Handling Strings with R.

Alice Tivarovsky