Reputation: 769
I've got the following code, which I expect to give me a list of 3, since there are 3 elements in texts
:
library(stringr)
texts <- c("I doubt it! :)", ";) disagree, but ok.", "No emoticons here!!!")
smileys <- c(":)","(:",";)",":D")
str_extract_all(texts, fixed(smileys))
Instead, I get a list of four (the length of my "pattern" parameter, here the smileys
. Additionally, I get the following warning message:
Warning message: In stri_extract_all_fixed(string, pattern, simplify = simplify, : longer object length is not a multiple of shorter object length```
Well, I don't imagine length will match, as I'm looking for any hits on any of the smileys in each text. It's not like I want to match string 1 with pattern 1, string 2 with pattern 2, etc.
Aware that I am messing up stringi's understanding of vectorizing, I have tried this instead:
texts %>% map(~ str_extract_all(.x, fixed(smileys)))
This is much better, as it gives me a list of 3, but each element is in turn a list of four.
What I'm trying to get to is a list of 3 that is as little nested as possible. Someone, somewhere, has solved this, but I can't for the life of me figure it out or get how to google it. I could do a for loop over this, but I consider myself a citizen of the tidyverse...
Grateful for any assistance.
Upvotes: 2
Views: 408
Reputation: 17611
You can use paste
to wrap each element of smiley
with \\Q
and \\E
and collapse on the regex "or" metacharacter (|
) to form a single pattern. As mentioned in the link Henrik shared and documented on ?regex
and in the stringi
manual, characters between \\Q
and \\E
are interpreted literally.
pattern <- paste("\\Q", smileys, "\\E", sep = "", collapse = "|")
# [1] "\\Q:)\\E|\\Q(:\\E|\\Q;)\\E|\\Q:D\\E"
library(stringi)
stri_extract_all_regex(texts, pattern)
#[[1]]
#[1] ":)"
#
#[[2]]
#[1] ";)"
#
#[[3]]
#[1] NA
Base R:
regmatches(texts, gregexpr(pattern, texts))
#[[1]]
#[1] ":)"
#
#[[2]]
#[1] ";)"
#
#[[3]]
#character(0)
# If you want an NA, instead of a zero-length vector,
# then you could do something like:
# lapply(
# regmatches(texts, gregexpr(pattern, texts)),
# function(ii) ifelse(is.character(ii) & length(ii) == 0L, NA, ii))
And if you do want to use purrr
and avoid regular expressions, one idea would be something like this:
library(purrr)
library(stringr)
texts %>%
map(~ unlist(str_extract_all(.x, fixed(smileys))))
#[[1]]
#[1] ":)"
#
#[[2]]
#[1] ";)"
#
#[[3]]
#character(0)
# if you want NA, not a zero-length vector, you could add:
# %>% map(~ ifelse(is.character(.x) & length(.x) == 0L, NA, .x))
Upvotes: 2