Reputation: 21
New programmer here.. I am trying to get all of the hashtags and links from a string. The regular expressions return the desired result when on their own; however, an empty list is returned when they are combined. How can one fix this?
import re
tweet = ('New PyBites article: Module of the Week - Requests-cache '
'for Repeated API Calls - http://pybit.es/requests-cache.html '
'#python #APIs')
# Get all hashtags and links from tweet
def get_hashtags_and_links(tweet=tweet):
tweet_regex = re.compile(r'''(
\(#\w+\)
\(https://[^\s]+\)
)''', re.VERBOSE)
tweet_object = tweet_regex.findall(tweet)
print(tweet_object)
get_hashtags_and_links()
Upvotes: 0
Views: 126
Reputation: 626845
Whatever you wanted to search for with your regex, you need to make sure you escape #
char that is special in case you compile the regex with re.X
/ re.VERBOSE
flag. This option enables comments inside the regex pattern that start with an unescaped hash symbol and go on till the line end.
When a line contains a
#
that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such#
through the end of the line are ignored.
So, assuming you want to match either hashtags or specific URLs you may use
tweet_regex = re.compile(r'''
\#\w+ # Hashtag pattern
| # or
https?://\S+ # URLs
''', re.VERBOSE)
See the Python code demo, output:
['http://pybit.es/requests-cache.html', '#python', '#APIs']
Upvotes: 0
Reputation: 281
You can use the regex as follows :
http_hash_search = re.compile(r"(\w+:\/\/\S+)|(#[A-Za-z0-9]+)")
#[A-Za-z0-9]+ --- This will search for #hashtag followed by any number or letters
(\w+://\S+) --- This will search for paths on the tweets
Upvotes: 0
Reputation: 113988
you are looking for #\w+
(enclosed in literal parenthesis) immediately followed by https://[^\s]+
(also enclosed in literal parenthesis) which appears no where in your text
instead use the |
or bar
re.compile(r'''(
\(#\w+\)|
\(https://[^\s]+\)
)''', re.VERBOSE)
but as pointed out \(
is looking for an actual parenthesis (it is not grouping)
so you probably just want
"(#\w+)|(https?://[^\s]+)"
you can use non-capturing groups((?:...)
) if you want as well
"((?:#\w+)|(?:https?://[^\s]+))"
Upvotes: 2