purelyp93
purelyp93

Reputation: 65

Python Regex expression for extracting hashtags from text

I'm processing some tweets I mined during the election and I need to a way to extract hashtags from tweet text while accounting punctuation, non-unicode characters, etc while still retaining the hashtag in the outputted list.

For example, the orignal text from a tweet looks like:

I'm with HER! #NeverTrump #DumpTrump #imwithher🇺🇸 @ Williamsburg, Brooklyn

and when turned into a string in python (or even put into a code block on this site), the special characters near the end are changed, producing this:

"I'm with HER! #NeverTrump #DumpTrump #imwithherdY\xd8\xa7dY\xd8, @ Williamsburg, Brooklyn"

now I would like to parse the string to be turned into a list like this:

['#NeverTrump','#DumpTrump', '#imwithher']

I'm currently using this expression where str is the above string:

tokenizedTweet = re.findall(r'(?i)\#\w+', str, flags=re.UNICODE)

however, I'm getting this as output:

['#NeverTrump', '#DumpTrump', '#imwithherdY\xd8']

How would I account for 'dY\xd8' in my regex to exclude it? I'm also open to other solutions not involving regex.

Upvotes: 1

Views: 3639

Answers (1)

Lord_PedantenStein
Lord_PedantenStein

Reputation: 500

Yah, about the solution not involving regex. ;)

# -*- coding: utf-8 -*-
import string 
tweets = []

a = "I'm with HER! #NeverTrump #DumpTrump #imwithher🇺🇸 @ Williamsburg, Brooklyn"

# filter for printable characters then
a = ''.join(filter(lambda x: x in string.printable, a))

print a

for tweet in a.split(' '):
    if tweet.startswith('#'):
        tweets.append(tweet.strip(','))

print tweets

and tada: ['#NeverTrump', '#DumpTrump', '#imwithher']

Upvotes: 3

Related Questions