Reputation: 109
Consider the following reproducible dataset which I created on the basis of the Donald Trump-Tweets dataset (which can be found here):
df <- tibble(target = c(rep("jeb-bush", 2), rep("jeb-bush-supporters", 2),
"jeb-staffer", rep("the-media", 5)),
tweet_id = seq(1, 10, 1))
It consists of two columns, the target group of the tweets and the tweet_id:
# A tibble: 10 x 2
target tweet_id
<chr> <dbl>
1 jeb-bush 1
2 jeb-bush 2
3 jeb-bush-supporters 3
4 jeb-bush-supporters 4
5 jeb-staffer 5
6 the-media 6
7 the-media 7
8 the-media 8
9 the-media 9
10 the-media 10
Goal:
Whenever an element in target
starts with jeb
, I want to extract the string pattern after the -
. And whenever there are multiple -
in an element which starts with jeb
, I want to extract the string pattern after the LAST -
(which in this example dataset would only be the case for jeb-bush-supporters
). For every element that doesn't start with jeb
, I just want to create the string other
.
In the end, it should look like this:
# A tibble: 10 x 3
target tweet_id new_var
<chr> <dbl> <chr>
1 jeb-bush 1 bush
2 jeb-bush 2 bush
3 jeb-bush-supporters 3 supporters
4 jeb-bush-supporters 4 supporters
5 jeb-staffer 5 staffer
6 the-media 6 other
7 the-media 7 other
8 the-media 8 other
9 the-media 9 other
10 the-media 10 other
What I have tried:
I have actually managed to create the desired output with the following code:
df %>%
mutate(new_var = case_when(str_detect(target, "^jeb-[a-z]+$") ~
str_extract(target, "(?<=[a-z]{3}-)[a-z]+"),
str_detect(target, "^jeb-[a-z]+-[a-z]+") ~
str_extract(target, "(?<=[a-z]{3}-[a-z]{4}-)[a-z]+"),
TRUE ~ "other"))
But the problem is this:
In the second str_extract
statement, I have to define the exact amount of letters in the 'Positive Look Behind' ([a-z]{4}
). Otherwise R is complaining about needing a "bounded maximum length". But what if I don't know the exact pattern length or if it would vary from element to element?
Alternatively, I tried to work with capture groups instead of with "Look Arounds". Therefore, I tried to include str_match
to define what I WANT to extract instead of what I DON'T want to extract:
df %>%
mutate(new_var = case_when(str_detect(target, "^jeb-[a-z]+$") ~
str_match(target, "jeb-([a-z]+)"),
str_detect(target, "^jeb-[a-z]+-[a-z]+") ~
str_match(target, "jeb-[a-z]+-([a-z]+)"),
TRUE ~ "other"))
But then I receive this error message:
Error: Problem with `mutate()` input `new_var`.
x `str_detect(target, "^jeb-[a-z]+$") ~ str_match(target, "jeb-([a-z]+)")`, `str_detect(target, "^jeb-[a-z]+-[a-z]+") ~ str_match(target,
"jeb-[a-z]{4}-([a-z]+)")` must be length 10 or one, not 20.
i Input `new_var` is `case_when(...)`.
Question:
Ultimately, I want to know if there is a concise way of extracting specific string patterns in a case_when-statement. How would I work around the problem that I stated here, when I wouldn't be able to use "Look Arounds" (because I can't define a bounded maximum length) nor capture groups (because str_match
would return a vector of length 20 and not of the original size 10 or one)?
Upvotes: 2
Views: 588
Reputation: 887028
An option is to check for target column with 'jeb-' substring from the beginning (^
) of the string in case_when
, then extract the characters that are not a -
([^-]+
) at the end ($
) of the string, or else (TRUE
) return the 'other'
library(dplyr)
library(stringr)
df %>%
mutate(new_var = case_when(str_detect(target, '^jeb-')~
str_extract(target, '[^-]+$'), TRUE ~ 'other'))
-output
# A tibble: 10 x 3
# target tweet_id new_var
# <chr> <dbl> <chr>
# 1 jeb-bush 1 bush
# 2 jeb-bush 2 bush
# 3 jeb-bush-supporters 3 supporters
# 4 jeb-bush-supporters 4 supporters
# 5 jeb-staffer 5 staffer
# 6 the-media 6 other
# 7 the-media 7 other
# 8 the-media 8 other
# 9 the-media 9 other
#10 the-media 10 other
We can also simplify this with str_match
and coalesce
df %>%
mutate(new_var = coalesce(str_match(target, '^jeb-.*?([^-]+)$')[,2], 'other'))
Upvotes: 3