Reputation: 10152
I am using gsub
to replace words in a vector in R following the idea of a dictionary. That is, a given sets of words (synonyms) syn = c("Cash", "\\$")
, are supposed to be replaced by a word (word = "MONEY"
).
text <- c("I spent 100$", "Cash can be used", "Cashier doesnt count", "a separate $")
So far I am using this to replace the synonyms:
syn <- c("Cash", "\\$")
word <- "MONEY"
regex <- paste0("\\b(", paste(syn, collapse = "|"), ")\\b")
# "\\b(Cash|\\$)\\b"
gsub(regex, word, text)
# "I spent 100MONEY" "MONEY can be used" "Cashier doesnt count" "a separate $"
Which works in the case where the $-sign is attached to alphanumerics, but fails if the sign is separated. If I abandon the word-boundary (\\b
), then the $-sign is found, but so is "Cash" in "Cashier".
Do you know how I am able to have a word-boundary but also find the single $-sign?
Upvotes: 2
Views: 855
Reputation: 4554
regex <- paste0("\\b", paste(syn, collapse = "\\b|"))
#"\\bCash\\b|\\$"
gsub(regex,word,text)
#[1] "I spent 100MONEY" "MONEY can be used" "Cashier doesnt count" "a seperate MONEY"
Upvotes: 0
Reputation: 626747
Use custom boundaries with a PCRE regex:
(?<!\p{L})
- beginning of a word (no letter before)(?!\p{L})
- end of a word (no letter after)See the regex demo.
Sample R code:
> text <- c("I spent 100$", "Cash can be used", "Cashier doesnt count", "a seperate $")
> syn <- c("Cash", "\\$")
> word <- "MONEY"
> regex <- paste0("(?<!\\p{L})(?:", paste(syn, collapse = "|"), ")(?!\\p{L})")
> gsub(regex, word, text, perl=TRUE)
[1] "I spent 100MONEY" "MONEY can be used" "Cashier doesnt count" "a seperate MONEY"
>
Upvotes: 2