Ioio
Ioio

Reputation: 65

Remove @mentions, urls and # symbols using python

Trying to remove @mentions, urls and # symbols from twitter data using python. To get

lets take action! fitness health 

from

@BBCNews lets take action! #fitness #health https://www.url.com

Code:

import re
df1 = re.sub(r'(?:\@|https?\://|#)\S+', '', df)

But this produces "lets take action! ", I'm having a hard time fixing my regex, but I think I'm close. How can I fix my regex?

Upvotes: 2

Views: 1079

Answers (1)

cs95
cs95

Reputation: 402513

Your pattern is incorrect because you're specifying the removal of \S+ chars after the # chars as well. Instead, change your pattern to,

>>> re.sub(r'(@|https?)\S+|#', '', text)
' lets take action! fitness health '

Regex Breakdown

(@       # match '@'
 |       # OR
 https?  # "http" or "https", followed by...
)
\S+      # one or more characters that aren't whitespace
|        # OR
#        # hashtag

As a bonus, the 3rd party tweet-processor module provides most of this functionality out-of-box, with optional customisations.

import preprocessor as p

p.clean(text)
# 'lets take action!'

# customise what you want removed
p.set_options(p.OPT.MENTION, p.OPT.URL,)
p.clean(text)
# 'lets take action! #fitness #health'

p.clean(text).replace('#', '')
# 'lets take action! fitness health'

Upvotes: 4

Related Questions