Reputation: 65
Trying to remove @mentions, urls and # symbols from twitter data using python. To get
lets take action! fitness health
from
@BBCNews lets take action! #fitness #health https://www.url.com
Code:
import re
df1 = re.sub(r'(?:\@|https?\://|#)\S+', '', df)
But this produces "lets take action! ", I'm having a hard time fixing my regex, but I think I'm close. How can I fix my regex?
Upvotes: 2
Views: 1079
Reputation: 402513
Your pattern is incorrect because you're specifying the removal of \S+
chars after the #
chars as well. Instead, change your pattern to,
>>> re.sub(r'(@|https?)\S+|#', '', text)
' lets take action! fitness health '
Regex Breakdown
(@ # match '@'
| # OR
https? # "http" or "https", followed by...
)
\S+ # one or more characters that aren't whitespace
| # OR
# # hashtag
As a bonus, the 3rd party tweet-processor module provides most of this functionality out-of-box, with optional customisations.
import preprocessor as p
p.clean(text)
# 'lets take action!'
# customise what you want removed
p.set_options(p.OPT.MENTION, p.OPT.URL,)
p.clean(text)
# 'lets take action! #fitness #health'
p.clean(text).replace('#', '')
# 'lets take action! fitness health'
Upvotes: 4