N1loon
N1loon

Reputation: 109

How to extract specific string patterns in a case_when statement using regular expressions?

Consider the following reproducible dataset which I created on the basis of the Donald Trump-Tweets dataset (which can be found here):

df <- tibble(target = c(rep("jeb-bush", 2), rep("jeb-bush-supporters", 2),
                        "jeb-staffer", rep("the-media", 5)),
             tweet_id = seq(1, 10, 1))

It consists of two columns, the target group of the tweets and the tweet_id:

# A tibble: 10 x 2
   target              tweet_id
   <chr>                  <dbl>
 1 jeb-bush                   1
 2 jeb-bush                   2
 3 jeb-bush-supporters        3
 4 jeb-bush-supporters        4
 5 jeb-staffer                5
 6 the-media                  6
 7 the-media                  7
 8 the-media                  8
 9 the-media                  9
10 the-media                 10

Goal:

Whenever an element in target starts with jeb, I want to extract the string pattern after the -. And whenever there are multiple - in an element which starts with jeb, I want to extract the string pattern after the LAST - (which in this example dataset would only be the case for jeb-bush-supporters). For every element that doesn't start with jeb, I just want to create the string other. In the end, it should look like this:

# A tibble: 10 x 3
   target              tweet_id new_var   
   <chr>                  <dbl> <chr>     
 1 jeb-bush                   1 bush      
 2 jeb-bush                   2 bush      
 3 jeb-bush-supporters        3 supporters
 4 jeb-bush-supporters        4 supporters
 5 jeb-staffer                5 staffer   
 6 the-media                  6 other     
 7 the-media                  7 other     
 8 the-media                  8 other     
 9 the-media                  9 other     
10 the-media                 10 other    

What I have tried:

I have actually managed to create the desired output with the following code:

df %>% 
    mutate(new_var = case_when(str_detect(target, "^jeb-[a-z]+$") ~
                             str_extract(target, "(?<=[a-z]{3}-)[a-z]+"),
                               str_detect(target, "^jeb-[a-z]+-[a-z]+") ~
                             str_extract(target, "(?<=[a-z]{3}-[a-z]{4}-)[a-z]+"),
                               TRUE ~ "other"))

But the problem is this:

In the second str_extract statement, I have to define the exact amount of letters in the 'Positive Look Behind' ([a-z]{4}). Otherwise R is complaining about needing a "bounded maximum length". But what if I don't know the exact pattern length or if it would vary from element to element?

Alternatively, I tried to work with capture groups instead of with "Look Arounds". Therefore, I tried to include str_match to define what I WANT to extract instead of what I DON'T want to extract:

df %>% 
    mutate(new_var = case_when(str_detect(target, "^jeb-[a-z]+$") ~
                             str_match(target, "jeb-([a-z]+)"),
                           str_detect(target, "^jeb-[a-z]+-[a-z]+") ~
                             str_match(target, "jeb-[a-z]+-([a-z]+)"),
                           TRUE ~ "other"))

But then I receive this error message:

Error: Problem with `mutate()` input `new_var`.
x `str_detect(target, "^jeb-[a-z]+$") ~ str_match(target, "jeb-([a-z]+)")`, `str_detect(target, "^jeb-[a-z]+-[a-z]+") ~ str_match(target, 
    "jeb-[a-z]{4}-([a-z]+)")` must be length 10 or one, not 20.
i Input `new_var` is `case_when(...)`.

Question:

Ultimately, I want to know if there is a concise way of extracting specific string patterns in a case_when-statement. How would I work around the problem that I stated here, when I wouldn't be able to use "Look Arounds" (because I can't define a bounded maximum length) nor capture groups (because str_match would return a vector of length 20 and not of the original size 10 or one)?

Upvotes: 2

Views: 588

Answers (1)

akrun
akrun

Reputation: 887028

An option is to check for target column with 'jeb-' substring from the beginning (^) of the string in case_when, then extract the characters that are not a - ([^-]+) at the end ($) of the string, or else (TRUE) return the 'other'

library(dplyr)
library(stringr)
df %>% 
    mutate(new_var = case_when(str_detect(target, '^jeb-')~ 
        str_extract(target, '[^-]+$'), TRUE ~ 'other'))

-output

# A tibble: 10 x 3
#   target              tweet_id new_var   
#   <chr>                  <dbl> <chr>     
# 1 jeb-bush                   1 bush      
# 2 jeb-bush                   2 bush      
# 3 jeb-bush-supporters        3 supporters
# 4 jeb-bush-supporters        4 supporters
# 5 jeb-staffer                5 staffer   
# 6 the-media                  6 other     
# 7 the-media                  7 other     
# 8 the-media                  8 other     
# 9 the-media                  9 other     
#10 the-media                 10 other    

We can also simplify this with str_match and coalesce

df %>% 
   mutate(new_var = coalesce(str_match(target, '^jeb-.*?([^-]+)$')[,2], 'other')) 

Upvotes: 3

Related Questions