Welcome to the collect data tutorial of the Practice R book (Treischl 2023). Practice R is a text book for the social sciences which provides several tutorials supporting students to learn R. Feel free to inspect the tutorials even if you are not familiar with the book, but keep in mind these tutorials are supposed to complement the Practice R book.
We extracted data from a PDF, I outlined the basics about web scraping, and we got in touch with APIs in Chapter 11. As outlined, to collect data offers unique opportunities for applied empirical research, but can be very tricky, especially web scraping becomes quickly complicated.
Regardless of the approach to collect data, I introduced the stringr package and its main functions before we extracted information from a PDF file and worked with unstructured data from HTML files (Wickham 2022c). To give you a compact overview about the many str_* functions, this tutorial is dedicated to the stringr package: We recapture the introduced functions and explore further possibilities how we can handle strings. The next console shows data and fictive email addresses from persons you may know from the Netflix series Stranger Things. Never mind if you are not familiar with the series, we will use the character variables such as the email addresses to work with stringr.
# Libs for Tutorial 11library(purrr)library(stringi)library(stringr)# The stranger things example datahead(sf_data)
#> # A tibble: 6 × 5
#> character firstname lastname year email
#> <chr> <chr> <chr> <dbl> <chr>
#> 1 Eleven Millie Bobby Brown 2004 eleven@HawkinsLab.com
#> 2 Dustin Henderson Gaten Matarazzo 2002 Dustin.Henderson@gmx.com
#> 3 Will Byers Noah Schnapp 2004 byers-castle@gmx.com
#> 4 Erica Sinclair Priah Ferguson 2006 Erica-Sinclair1@aol.com
#> 5 Martin Brenner Matthew Modine 1959 MBrenner@HawkinsLab.com2
#> 6 Jim Hopper David Harbour 1975 jim.hopper@hawkinspd.com
The stringr package increases your string powers tremendously, but we need to keep up with many str_* functions and names. All you have to do is pick the “right” function in this tutorial. For the compact overview, we focus on the sections of the package cheat sheet: (1) We detect matches; (2) we mutate strings; (3) we subset strings; (4) we join and split strings; and (5) we order strings and manage their length.
9.1 Detect matches
Suppose we want to create an online survey which is why we scraped emails of our participants such as in the fictive email addresses from the Stranger Things data. Unfortunately, the strings contain some minor mistakes that need to be fixed:
Notice, some email addresses start (end) with a number instead of letters. Those signs are not a part of the email address but refer to footnotes on the webpage where we scraped the data. Suppose we do not know how virulent this problem is, can you detect which one does not start (str_starts) or end (str_ends) with a letter?
# Does the string start with ...?str_starts(emails, "[:alpha:]")
Some of the email addresses are private, while others are from a company (e.g., HawkinsLab.com). If you need to know how many, use the str_count() function and build the sum. How many email addresses are from HawkinsLab.com?
The str_which() is also handy, it returns at which position we observe the search pattern.
# And at which position?str_which(emails, "@HawkinsLab.com")
#> [1] 1 5
Suppose we need to extract the user names because we want to include them in the email invitation for the survey. In order to extract the names, locate the position of a string. Use the str_locate() to locate where the @ sign appears, because it splits the string into the user and the provider name.
# Locate a start and an end point (here @)str_locate(emails, "@")
In the next step we will use the position of the @ sign to mutate the strings and to extract their user names.
9.2 Mutate strings
Let us first clean the email addresses. Remove strings that do not start or end with a letter but with a number, which is clearly an error. Very similar to the str_replace() function, the str_remove() searches the string, but it removes a match instead of performing a replacement. Can you still remember how to remove the digits from the beginning (^) and the end ($) of a string? Replace the emails vector and check if it worked.
# Remove stringsemails <-str_remove(emails, "^[:digit:]")emails <-str_remove(emails, "[:digit:]$")# Did it work?emails
We could use the str_extract() function and our regex knowledge to extract the user names, but regex are hard to build even in the case of a supposedly simple strings. The email addresses make this point clear: Each user name consist of one or several words; some have a separator between the first and the last name, some contains digits (or not), and the user name ends before the @ sign. There is a much simpler solution to extract the user names, but nevertheless keep the str_view_all() function in mind if you are building a regex because it displays the strings in the viewer pane and highlights matched characters.
Instead of building a regex, we can use the str_sub() function to create a vector with the user names only. The function needs the strings, a start, and an endpoint to create the subset. For this purpose we already located the positions of the @ sign with the str_locate() function. Thus, all user names start at the first position until the @ sign appears in the string. I copied the code to locate the @ sign and saved the results as x. Subset x to get a vector with the end position of the user name, then subset the emails.
# Get and set substrings using their positionsx <-str_locate(emails, "@")end <- x[, 1]names <-str_sub(emails, 1, end -1)names
Further steps to manipulate the strings might be easier to apply if all the user would have used the same style regarding their user names. Use the str_replace() function and replace the dashes with points.
Depending on the purpose, it might also be useful to create a uniform formatting of the strings. Use one of the str_to_*() functions to make them lower, upper, or title case.
We used the str_sub() to split strings by their position, but the str_subset() function lets us create a subset for a search pattern. For example, consider all participants with an specific email account (e.g., gmx):
Furthermore, most of the time we use the str_detect() function to detect a pattern. For example, the functions shows us which input has a specific pattern and we can detect if an string has no @ sign at all.
strings <-c("Dustin Henderson","hop@gmx.com jim.hopper@hawkinspd.com","Erica-Sinclair@aol.com","nancy-wheeler92@gmx.com")is_email <-"@"# Detect a patternstr_detect(strings, is_email)
#> [1] FALSE TRUE TRUE TRUE
We used the function to illustrate the first few things about regular expressions. However, we do not need to filter the data and first detect the email addresses if we want to extract this information. Consider how the str_extract() and the str_extract_all function work. The function needs strings and a pattern (such as is_email). It shows us which string does (not) include the given pattern.
# Extract the complete matchstr_extract(strings, is_email)
The stringr package has join and split functions. Suppose we scraped the first and the last name of a person separately, but for the survey invitation we need to combine them. Use str_c() for this job and assign them as names. Combine the firstname with the lastname from the sf_data. Use a blank space as a separator (sep).
# Use str_c to combine stringsnames <-str_c(sf_data$firstname, sf_data$lastname, sep =" ")names
Use the str_split_fixed() in the opposite scenario. Split the names vector from the last task: Use the blank space as a pattern and each name consist of two text chunks we want to split (n).
# Split stringsstr_split_fixed(names, pattern =" ", n =2)
We used the str_sub() function to extract the user names, but we could also use the str_split() function to split the strings before and after the @ sign. Say we want to extract unique provider names this time. The str_split() function returns a list as the next console shows. Use the pipe and the map_chr() functions from purrr to get the first or second element of each list (Henry and Wickham 2022). Furthermore, apply the stri_unique() function from stringi to examine unique provider names only (Gagolewski et al. 2022).
# Split email, get provider names, but only unique onesstr_split(emails, pattern ="@") |> purrr::map_chr(2) |> stringi::stri_unique()
The glue package offers some useful features to work with strings, especially if we create texts and documents. Suppose we want to create a sentence that describe how old a person like Jim Hopper is. I already calculated his age (hopper_age); use the paste function to create a sentences that describes how old he is.
Did you realize that we need a lot of quotation marks and that we need to be careful not to introduce any error. The str_glue() tries to improve this case. We can refer to objects with curved braces without further ado.
# Glue stringsstr_glue("Hop is {hopper_age} years.")
#> Hop is 49 years.
One step further goes the str_glue_data() function. It returns strings for each observation of a data set. For example, build a sentence that outlines the firstname, lastname and the birth year of the Stranger Things actors.
# Glue strings from datastr_glue_data(sf_data, "- {firstname} {lastname} is born in {year}.")
#> - Millie Bobby Brown is born in 2004.
#> - Gaten Matarazzo is born in 2002.
#> - Noah Schnapp is born in 2004.
#> - Priah Ferguson is born in 2006.
#> - Matthew Modine is born in 1959.
#> - David Harbour is born in 1975.
#> - Winona Ryder is born in 1971.
#> - Finn Wolfhard is born in 2002.
#> - Natalia Dyer is born in 1995.
Finally, the package offers functions to order strings and manage their length.
9.5 Length and order
Do not forget that stringr comes with example strings (fruit, sentences) that lets you test the functions before you run them in the wild, but of course we can also build our own fruits. So, do you remember how we can estimate the length of strings?
# Length of a stringfruits <-c("banana", "apricot", "apple", "pear ")str_length(fruits)
#> [1] 6 7 5 9
Unfortunately, the fruits vector includes an mistake. There is a lot of white space around the last fruit. Do you know how to get rid of such noise.
# Trim your stringsfruits <-str_trim(fruits)fruits
#> [1] "banana" "apricot" "apple" "pear"
Finally, order (str_order) and sort (str_sort) the fruits.
# Order stringsstr_order(fruits)
#> [1] 3 2 1 4
# Sort stringsstr_sort(fruits, decreasing = F)
#> [1] "apple" "apricot" "banana" "pear"
9.6 Summary
Keep also the following functions and packages from Chapter 11 in mind:
Gagolewski, Marek, Bartek Tartanus, others; Unicode, Inc., et al. 2022. stringi: Fast and Portable Character String Processing Facilities. https://CRAN.R-project.org/package=stringi.