Reputation: 21400
In this sample text:
turns <- tolower(c("Does him good to stir him up now and again .",
"When , when I see him he w's on the settees .",
"Yes it 's been eery for a long time .",
"blissful timing , indeed it was "))
I'd like to extract all words that contain the letters y
and e
no matter what position or combination, namely yes
and eery
, using str_extract
from stringr
:
This regex, in which I determine that y
occur immediately before e
, matches not surprisingly only yes
but not eery
:
unlist(str_extract_all(turns, "\\b([a-z]+)?ye([a-z]+)?\\b"))
[1] "yes"
Putting y
and e
into a character class doesn't get me the desired result either in that all words either with y
or with e
are matched:
unlist(str_extract_all(turns, "\\b([a-z]+)?[ye]([a-z]+)?\\b"))
[1] "does" "when" "when" "see" "he" "the" "settees" "yes" "been" "eery" "time" "indeed"
So what is the right solution?
Upvotes: 1
Views: 1408
Reputation: 626738
You may use both base R and stringr
approaches:
stringr::str_extract_all(turns, "\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b")
regmatches(turns, gregexpr("\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b", turns, perl=TRUE))
Or, without turning the strings to lower case, you may use a case insensitive matching with (?i)
:
stringr::str_extract_all(turns, "(?i)\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b")
regmatches(turns, gregexpr("\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b", turns, perl=TRUE, ignore.case=TRUE))
See the regex demo and the R demo. Also, if you want to make it a tiny bit more efficient, you may use principle of contrast in the lookahead patterns: match any letters but y
in the first and all letters but the e
in the second using character class substraction:
stringr::str_extract_all(turns, "(?i)\\b(?=[\\p{L}--[y]]*y)(?=[\\p{L}--[e]]*e)\\p{L}+\\b")
Details
(?i)
- case insensitive modifier\b
- word boundary(?=\p{L}*y)
- after 0 or more Unicode letters, there must be y
([\p{L}--[y]]*
matches any 0 or more letters but y
up to the first y
)(?=\p{L}*e)
- after 0 or more Unicode letters, there must be e
([\p{L}--[e]]*
matches any 0 or more letters but e
up to the first e
)\p{L}+
- 1 or more Unicode letters\b
- word boundaryUpvotes: 1
Reputation: 39657
In case there is no urgent need to use stringr::str_extract
you can get words containing the letters y and e in base with strsplit
and grepl
like:
tt <- unlist(strsplit(turns, " "))
tt[grepl("y", tt) & grepl("e", tt)]
#[1] "yes" "eery"
In case you have letter chunks between words:
turns <- c("yes no ay ae 012y345e year.")
tt <- regmatches(turns, gregexpr("\\b[[:alpha:]]+\\b", turns))[[1]]
tt[grepl("y", tt) & grepl("e", tt)]
#[1] "yes" "year"
Upvotes: 1