Regex R: Word-boundary and separated characters

Question

I am using gsub to replace words in a vector in R following the idea of a dictionary. That is, a given sets of words (synonyms) syn = c("Cash", "\$"), are supposed to be replaced by a word (word = "MONEY").

text <- c("I spent 100$", "Cash can be used", "Cashier doesnt count", "a separate $")

So far I am using this to replace the synonyms:

syn <- c("Cash", "\$")
word <- "MONEY"

regex <- paste0("\b(", paste(syn, collapse = "|"), ")\b")
# "\b(Cash|\$)\b"

gsub(regex, word, text)
# "I spent 100MONEY"     "MONEY can be used"    "Cashier doesnt count" "a separate $"

Which works in the case where the $-sign is attached to alphanumerics, but fails if the sign is separated. If I abandon the word-boundary (\b), then the $-sign is found, but so is "Cash" in "Cashier".

Do you know how I am able to have a word-boundary but also find the single $-sign?

Wiktor Stribiżew · Accepted Answer

Use custom boundaries with a PCRE regex:

(? - beginning of a word (no letter before)


(?!\p{L}) - end of a word (no letter after)



See the regex demo.

Sample R code:

> text <- c("I spent 100$", "Cash can be used", "Cashier doesnt count", "a seperate $")
> syn <- c("Cash", "\$")
> word <- "MONEY"
> regex <- paste0("(? gsub(regex, word, text, perl=TRUE)
[1] "I spent 100MONEY"     "MONEY can be used"    "Cashier doesnt count" "a seperate MONEY"    
>

Regex R: Word-boundary and separated characters

Answers (2)

Related Questions