kmd0193
kmd0193

Reputation: 21

Why does this regex return an empty list?

New programmer here.. I am trying to get all of the hashtags and links from a string. The regular expressions return the desired result when on their own; however, an empty list is returned when they are combined. How can one fix this?

import re

tweet = ('New PyBites article: Module of the Week - Requests-cache '
     'for Repeated API Calls - http://pybit.es/requests-cache.html '
     '#python #APIs')


# Get all hashtags and links from tweet
def get_hashtags_and_links(tweet=tweet):
tweet_regex = re.compile(r'''(
                         \(#\w+\)
                         \(https://[^\s]+\)
                         )''', re.VERBOSE)

tweet_object = tweet_regex.findall(tweet)
print(tweet_object)

get_hashtags_and_links()

Upvotes: 0

Views: 126

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626845

Whatever you wanted to search for with your regex, you need to make sure you escape # char that is special in case you compile the regex with re.X / re.VERBOSE flag. This option enables comments inside the regex pattern that start with an unescaped hash symbol and go on till the line end.

When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

So, assuming you want to match either hashtags or specific URLs you may use

tweet_regex = re.compile(r'''
                     \#\w+             # Hashtag pattern
                     |                 # or
                     https?://\S+      # URLs
                     ''', re.VERBOSE)

See the Python code demo, output:

['http://pybit.es/requests-cache.html', '#python', '#APIs']

Upvotes: 0

Ashwiniku918
Ashwiniku918

Reputation: 281

You can use the regex as follows :

    http_hash_search = re.compile(r"(\w+:\/\/\S+)|(#[A-Za-z0-9]+)")

#[A-Za-z0-9]+ --- This will search for #hashtag followed by any number or letters

(\w+://\S+) --- This will search for paths on the tweets

Upvotes: 0

Joran Beasley
Joran Beasley

Reputation: 113988

you are looking for #\w+(enclosed in literal parenthesis) immediately followed by https://[^\s]+(also enclosed in literal parenthesis) which appears no where in your text

instead use the | or bar

re.compile(r'''(
            \(#\w+\)|
            \(https://[^\s]+\)
                     )''', re.VERBOSE)

but as pointed out \( is looking for an actual parenthesis (it is not grouping)

so you probably just want

"(#\w+)|(https?://[^\s]+)"

you can use non-capturing groups((?:...)) if you want as well

"((?:#\w+)|(?:https?://[^\s]+))"

Upvotes: 2

Related Questions