Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

How to extract words containing combinations of certain characters in R

In this sample text:

turns <- tolower(c("Does him good to stir him up now and again .", 
           "When , when I see him he w's on the settees .",
           "Yes it 's been eery for a long time .",
           "blissful timing , indeed it was "))

I'd like to extract all words that contain the letters y and e no matter what position or combination, namely yesand eery, using str_extract from stringr:

This regex, in which I determine that y occur immediately before e, matches not surprisingly only yes but not eery:

unlist(str_extract_all(turns, "\\b([a-z]+)?ye([a-z]+)?\\b"))
[1] "yes"

Putting yand e into a character class doesn't get me the desired result either in that all words either with y or with e are matched:

unlist(str_extract_all(turns, "\\b([a-z]+)?[ye]([a-z]+)?\\b")) 
 [1] "does"    "when"    "when"    "see"     "he"      "the"     "settees" "yes"     "been"    "eery"    "time"    "indeed" 

So what is the right solution?

Upvotes: 1

Views: 1408

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

You may use both base R and stringr approaches:

stringr::str_extract_all(turns, "\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b")
regmatches(turns, gregexpr("\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b", turns, perl=TRUE))

Or, without turning the strings to lower case, you may use a case insensitive matching with (?i):

stringr::str_extract_all(turns, "(?i)\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b")
regmatches(turns, gregexpr("\\b(?=\\p{L}*y)(?=\\p{L}*e)\\p{L}+\\b", turns, perl=TRUE, ignore.case=TRUE))

See the regex demo and the R demo. Also, if you want to make it a tiny bit more efficient, you may use principle of contrast in the lookahead patterns: match any letters but y in the first and all letters but the e in the second using character class substraction:

stringr::str_extract_all(turns, "(?i)\\b(?=[\\p{L}--[y]]*y)(?=[\\p{L}--[e]]*e)\\p{L}+\\b")

Details

  • (?i) - case insensitive modifier
  • \b - word boundary
  • (?=\p{L}*y) - after 0 or more Unicode letters, there must be y ([\p{L}--[y]]* matches any 0 or more letters but y up to the first y)
  • (?=\p{L}*e) - after 0 or more Unicode letters, there must be e ([\p{L}--[e]]* matches any 0 or more letters but e up to the first e)
  • \p{L}+ - 1 or more Unicode letters
  • \b - word boundary

Upvotes: 1

GKi
GKi

Reputation: 39657

In case there is no urgent need to use stringr::str_extract you can get words containing the letters y and e in base with strsplit and grepl like:

tt <- unlist(strsplit(turns, " "))
tt[grepl("y", tt) & grepl("e", tt)]
#[1] "yes"  "eery"

In case you have letter chunks between words:

turns <- c("yes no ay ae 012y345e year.")
tt <- regmatches(turns, gregexpr("\\b[[:alpha:]]+\\b", turns))[[1]]
tt[grepl("y", tt) & grepl("e", tt)]
#[1] "yes"  "year"

Upvotes: 1

Related Questions