Xiaoyi Zhang
Xiaoyi Zhang

Reputation: 37

How to remove digits in a string except those in hashtags using regex

I'm processing some twitter texts, and I want to remove all numbers in a tweet except those that appear in hashtags. For example,

'I wrote 16 scripts in #code100day challenge2019 in 10day' 

should become

'I wrote scripts in #code100day challenge in day'

Note that numbers not separated from alphabetic characters should also be removed (i.e. 'challenge2019' --> 'challenge', '10day' --> 'day').

I tried:

text = 'I wrote 16 scripts in #code100day challenge2019 in 10day'
text = re.sub(r"^(?!#)\d+", "", text)

But it does not do anything to the input string.

I also did negative lookbehind, trying to remove all digits except those following the '#' symbol:

text = re.sub(r"(?<!#)\d+", "", text)

But now it removes all the numeric characters no matter in hashtag or not:

'I wrote  scripts in #codeday challenge in day'

Any suggestions?

Upvotes: 2

Views: 488

Answers (3)

CertainPerformance
CertainPerformance

Reputation: 371208

One option is to match # followed by non-space characters (and, if matched, replace with the whole match, effectively leaving the hashtag alone), or match digit characters and remove them:

output = re.sub(
    r'#\S+|\d+',
    lambda match: match.group(0) if match.group(0).startswith('#') else '',
    txt
)

If you can use the regex module, you can use (*SKIP)(*FAIL) after matching hashtags instead, to effectively skip them if matched:

output = regex.sub(r'#\S+(*SKIP)(*FAIL)|\d+', '', txt)

Upvotes: 1

Emma
Emma

Reputation: 27743

My guess is that using an alternation would likely be faster and simpler than lookarounds:

import re

test_str = "10 I wrote 16 scripts in #code100day challenge2019 in 10day 100 "

print(re.sub(r"^\s+|\s+$","",re.sub(r"\s{2,}"," ",re.sub(r"(#\S+)|(\d+)", "\\1", test_str))))

Output

I wrote scripts in #code100day challenge in day

The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.

Upvotes: 1

NPC
NPC

Reputation: 90

Please try this:

Just checking for the digit with space(Before/after) and replacing with space.

text = 'I wrote 16 scripts in #code100day challenge2019 in 10day'
re.sub("\d+ | \d+", " ", text)

O/P: 'I wrote scripts in #code100day challenge in day'

You can use like this also, which will give the same result

re.sub("\d+\s|\s\d+", " ", text)

Upvotes: 0

Related Questions