liamt12three
liamt12three

Reputation: 57

Faster way to use regex to extract hashtags from tweets

I have a pandas dataframe with details of 1 million tweets including the tweet itself and various other attributes. I am trying to extract a list of hashtags from the tweets. Its is important that the list is still associated with each tweet rather than being a list of hashtags in all tweets.

The number of tweets I have means it will take hours/days to run. Is there an alternative to using iterrows over my pandas dataframe as I have already tried?

def extracthash(x):
    for index, row in tweets_scored.iterrows():
    tweets_scored.loc[:,"Hashtags"]= tweets_scored.text.str.find(r'#.*?(?=\s|$)')
    return tweets_scored

tweets_scored.apply(extracthash, axis=1)

This is what I am aiming for and the code does work if I take subset of only a small number of rows in my dataframe.

text                                    hashtag list

I like #cheese and #flour        [#cheese, #flour] 

He eats #bread                            [#bread]

Any help sincerely appreciated! Thanks

Upvotes: 1

Views: 914

Answers (1)

Stein
Stein

Reputation: 809

I am using this little loop for a similar situation (NLP on tweets) to extract the hashtag and the at references of a tweet. It is fast and simple:

import re
tHash = []
tAt = []
for item in tweets:
    if re.search('^@.*', item):
       tAt.append(item)

    if re.search('^#.*', item):
       tHash.append(item)

Upvotes: 1

Related Questions