Reputation: 527
Python noob here. (full disclosure)
I've got a list of Tweets that is formatted as a list of strings, like so:
["This is a string that needs processing #ugh #yikes",
"this string doesn't have hashtags",
"this is another one #hooray"]
I'm trying to write a function that will create a list of the hashtags in each line but leave blank entries when there aren't any entries. This is because I want to join this list with the tweets themselves later. This is my desired output:
['#ugh', '#yikes'], [], ['#hooray']
This function which I found here works fine for ONE string.
mystring = "I love #stackoverflow because #people are very #helpful!"
But it doesn't seem to work for several strings. This is my code:
l = len(mystringlist)
it = iter(mystringlist)
taglist = []
def extract_tags(it,l):
for item in mystringlist:
output = list([re.sub(r"(\W+)$", "", j) for j in list([i for i in
item.split() if i.startswith("#")])])
taglist.append(output)
multioutput = extract_tags(mystringlist,l)
print(multioutput)
Upvotes: 0
Views: 2322
Reputation: 18385
This could be considered unreadable or overkill for the task at hand, but avoids using regular expressions and should therefore be somewhat faster:
>>> def hashtags(tweet):
.... return list(filter(lambda token: token.startswith('#'), tweet.split()))
>>> [hashtags(tweet) for tweet in tweets]
[['#ugh', '#yikes'], [], ['#hooray']]
Upvotes: 1
Reputation: 51155
You can use a regular expression and re.findall
.
#\w+
will match a hashtag followed by any word character, which is equivalent to [a-zA-Z0-9_]
x = ["This is a string that needs processing #ugh #yikes",
"this string doesn't have hashtags",
"this is another one #hooray"]
import re
hashtags = [re.findall('#\w+', i) for i in x]
print(hashtags)
Output:
[['#ugh', '#yikes'], [], ['#hooray']]
If the regular expression does not match anything, an empty list will be returned, as is expected in your desired output.
If there is the possibility of your text containing urls
, something like www.mysite.com/#/dashboard
, you could use:
To ensure that the hashtag is found following whitespace or at the start of a line.
Upvotes: 2