David
David

Reputation: 10152

Regex R: Word-boundary and separated characters

I am using gsub to replace words in a vector in R following the idea of a dictionary. That is, a given sets of words (synonyms) syn = c("Cash", "\\$"), are supposed to be replaced by a word (word = "MONEY").

text <- c("I spent 100$", "Cash can be used", "Cashier doesnt count", "a separate $")

So far I am using this to replace the synonyms:

syn <- c("Cash", "\\$")
word <- "MONEY"

regex <- paste0("\\b(", paste(syn, collapse = "|"), ")\\b")
# "\\b(Cash|\\$)\\b"

gsub(regex, word, text)
# "I spent 100MONEY"     "MONEY can be used"    "Cashier doesnt count" "a separate $" 

Which works in the case where the $-sign is attached to alphanumerics, but fails if the sign is separated. If I abandon the word-boundary (\\b), then the $-sign is found, but so is "Cash" in "Cashier".

Do you know how I am able to have a word-boundary but also find the single $-sign?

Upvotes: 2

Views: 855

Answers (2)

Shenglin Chen
Shenglin Chen

Reputation: 4554

regex <- paste0("\\b", paste(syn, collapse = "\\b|"))
#"\\bCash\\b|\\$"
gsub(regex,word,text)
#[1] "I spent 100MONEY"     "MONEY can be used"    "Cashier doesnt count" "a seperate MONEY" 

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626747

Use custom boundaries with a PCRE regex:

  • (?<!\p{L}) - beginning of a word (no letter before)
  • (?!\p{L}) - end of a word (no letter after)

See the regex demo.

Sample R code:

> text <- c("I spent 100$", "Cash can be used", "Cashier doesnt count", "a seperate $")
> syn <- c("Cash", "\\$")
> word <- "MONEY"
> regex <- paste0("(?<!\\p{L})(?:", paste(syn, collapse = "|"), ")(?!\\p{L})")
> gsub(regex, word, text, perl=TRUE)
[1] "I spent 100MONEY"     "MONEY can be used"    "Cashier doesnt count" "a seperate MONEY"    
> 

Upvotes: 2

Related Questions