Reputation: 132
While Replace two dots in a string with gsub answers the question about replacing punctuation characters like '.', it does not seem to work for word boundaries. For example,
text100 <- "My # is 1234"
text1 <- gsub("\\b#\\b","hash",text100)
> text1
[1] "My # is 1234"
The #
is not getting replaced. How to address this?
Note that multiple #s should not be replaced. For example,
'##' should NOT be replaced as 'hash' or 'hashhash'.
# followed or preceded by any graph character should not be replaced (for example, '.#' should not be replaced)
Upvotes: 1
Views: 802
Reputation: 626870
Your regex does not work because the hash is not a word character and you require a word character to be on both sides of the hash.
If you want to make sure there are no word characters around #
symbol, use a Perl-style regex replacement:
text100 <- "My # is 1234"
gsub("(?<!\\w)\\#+(?!\\w)","hash",text100, perl=T)
See IDEONE demo
The look-behind (?<!\\w)
makes sure there is no letter, digit or underscore before #
, and the (?!\\w)
look-ahead makes sure there is no letter, digit or underscore after it.
To avoid overescaping, you may put the hash into a character class:
"(?<!\\w)[#]+(?!\\w)"
Using +
quantifier after a hash symbol will make sure multiple consecutive hashes are replaced with one word "hash".
UPDATE
A solution that shouls work for your updated example:
gsub("(?<!\\w|#)[#](?!\\w|#)", "hash", text100, perl=T)
Here, (?<!\\w|#)
will make sure that a hash is not preceded with a word character or a hash symbol, and the (?!\\w|#)
negative lookahead will make sure there is no word character or a hash symbol after a hash symbol.
See another demo
Upvotes: 5