Reputation: 37
I'm processing some twitter texts, and I want to remove all numbers in a tweet except those that appear in hashtags. For example,
'I wrote 16 scripts in #code100day challenge2019 in 10day'
should become
'I wrote scripts in #code100day challenge in day'
Note that numbers not separated from alphabetic characters should also be removed (i.e. 'challenge2019'
--> 'challenge'
, '10day'
--> 'day'
).
I tried:
text = 'I wrote 16 scripts in #code100day challenge2019 in 10day'
text = re.sub(r"^(?!#)\d+", "", text)
But it does not do anything to the input string.
I also did negative lookbehind, trying to remove all digits except those following the '#'
symbol:
text = re.sub(r"(?<!#)\d+", "", text)
But now it removes all the numeric characters no matter in hashtag or not:
'I wrote scripts in #codeday challenge in day'
Any suggestions?
Upvotes: 2
Views: 488
Reputation: 371208
One option is to match #
followed by non-space characters (and, if matched, replace with the whole match, effectively leaving the hashtag alone), or match digit characters and remove them:
output = re.sub(
r'#\S+|\d+',
lambda match: match.group(0) if match.group(0).startswith('#') else '',
txt
)
If you can use the regex module, you can use (*SKIP)(*FAIL)
after matching hashtags instead, to effectively skip them if matched:
output = regex.sub(r'#\S+(*SKIP)(*FAIL)|\d+', '', txt)
Upvotes: 1
Reputation: 27743
My guess is that using an alternation would likely be faster and simpler than lookarounds:
import re
test_str = "10 I wrote 16 scripts in #code100day challenge2019 in 10day 100 "
print(re.sub(r"^\s+|\s+$","",re.sub(r"\s{2,}"," ",re.sub(r"(#\S+)|(\d+)", "\\1", test_str))))
I wrote scripts in #code100day challenge in day
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
Upvotes: 1
Reputation: 90
Please try this:
Just checking for the digit with space(Before/after) and replacing with space.
text = 'I wrote 16 scripts in #code100day challenge2019 in 10day'
re.sub("\d+ | \d+", " ", text)
O/P: 'I wrote scripts in #code100day challenge in day'
You can use like this also, which will give the same result
re.sub("\d+\s|\s\d+", " ", text)
Upvotes: 0