Reputation: 57
I have a pandas dataframe with details of 1 million tweets including the tweet itself and various other attributes. I am trying to extract a list of hashtags from the tweets. Its is important that the list is still associated with each tweet rather than being a list of hashtags in all tweets.
The number of tweets I have means it will take hours/days to run. Is there an alternative to using iterrows over my pandas dataframe as I have already tried?
def extracthash(x):
for index, row in tweets_scored.iterrows():
tweets_scored.loc[:,"Hashtags"]= tweets_scored.text.str.find(r'#.*?(?=\s|$)')
return tweets_scored
tweets_scored.apply(extracthash, axis=1)
This is what I am aiming for and the code does work if I take subset of only a small number of rows in my dataframe.
text hashtag list
I like #cheese and #flour [#cheese, #flour]
He eats #bread [#bread]
Any help sincerely appreciated! Thanks
Upvotes: 1
Views: 914
Reputation: 809
I am using this little loop for a similar situation (NLP on tweets) to extract the hashtag and the at references of a tweet. It is fast and simple:
import re
tHash = []
tAt = []
for item in tweets:
if re.search('^@.*', item):
tAt.append(item)
if re.search('^#.*', item):
tHash.append(item)
Upvotes: 1