Venkat Ramakrishnan
Venkat Ramakrishnan

Reputation: 132

Word boundaries handling for punctuation characters in regexp in R

While Replace two dots in a string with gsub answers the question about replacing punctuation characters like '.', it does not seem to work for word boundaries. For example,

text100 <- "My # is 1234"
text1 <- gsub("\\b#\\b","hash",text100)
> text1
[1] "My # is 1234"

The # is not getting replaced. How to address this?

Note that multiple #s should not be replaced. For example,

'##' should NOT be replaced as 'hash' or 'hashhash'.

# followed or preceded by any graph character should not be replaced (for example, '.#' should not be replaced)

Upvotes: 1

Views: 802

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626870

Your regex does not work because the hash is not a word character and you require a word character to be on both sides of the hash.

If you want to make sure there are no word characters around # symbol, use a Perl-style regex replacement:

text100 <- "My # is 1234"
gsub("(?<!\\w)\\#+(?!\\w)","hash",text100, perl=T)

See IDEONE demo

The look-behind (?<!\\w) makes sure there is no letter, digit or underscore before #, and the (?!\\w) look-ahead makes sure there is no letter, digit or underscore after it.

To avoid overescaping, you may put the hash into a character class:

"(?<!\\w)[#]+(?!\\w)"

Using + quantifier after a hash symbol will make sure multiple consecutive hashes are replaced with one word "hash".

UPDATE

A solution that shouls work for your updated example:

gsub("(?<!\\w|#)[#](?!\\w|#)", "hash", text100, perl=T)

Here, (?<!\\w|#) will make sure that a hash is not preceded with a word character or a hash symbol, and the (?!\\w|#) negative lookahead will make sure there is no word character or a hash symbol after a hash symbol.

See another demo

Upvotes: 5

Related Questions