lespaul
lespaul

Reputation: 527

Extracting hashtags from each string in a list of strings in Python

Python noob here. (full disclosure)

I've got a list of Tweets that is formatted as a list of strings, like so:

["This is a string that needs processing #ugh #yikes",
"this string doesn't have hashtags",
"this is another one #hooray"]

I'm trying to write a function that will create a list of the hashtags in each line but leave blank entries when there aren't any entries. This is because I want to join this list with the tweets themselves later. This is my desired output:

['#ugh', '#yikes'], [], ['#hooray']

This function which I found here works fine for ONE string.

 mystring = "I love #stackoverflow because #people are very #helpful!"

But it doesn't seem to work for several strings. This is my code:

 l = len(mystringlist)
 it = iter(mystringlist)

 taglist = []

 def extract_tags(it,l):
      for item in mystringlist:
         output = list([re.sub(r"(\W+)$", "", j) for j in list([i for i in 
         item.split() if i.startswith("#")])])
    taglist.append(output)

 multioutput = extract_tags(mystringlist,l)

 print(multioutput)

Upvotes: 0

Views: 2322

Answers (2)

Tim McNamara
Tim McNamara

Reputation: 18385

This could be considered unreadable or overkill for the task at hand, but avoids using regular expressions and should therefore be somewhat faster:

>>> def hashtags(tweet):
....    return list(filter(lambda token: token.startswith('#'), tweet.split()))

>>> [hashtags(tweet) for tweet in tweets]
[['#ugh', '#yikes'], [], ['#hooray']]

Upvotes: 1

user3483203
user3483203

Reputation: 51155

You can use a regular expression and re.findall.

#\w+ will match a hashtag followed by any word character, which is equivalent to [a-zA-Z0-9_]

x = ["This is a string that needs processing #ugh #yikes",
"this string doesn't have hashtags",
"this is another one #hooray"]

import re

hashtags = [re.findall('#\w+', i) for i in x]
print(hashtags)

Output:

[['#ugh', '#yikes'], [], ['#hooray']]

If the regular expression does not match anything, an empty list will be returned, as is expected in your desired output.

If there is the possibility of your text containing urls, something like www.mysite.com/#/dashboard, you could use:

[\s^](#\w+)

To ensure that the hashtag is found following whitespace or at the start of a line.

Upvotes: 2

Related Questions