Python regex alternation

Question

I'm trying to find all links on a webpage in the form of "http://something" or https://something. I made a regex and it works:

L = re.findall(r"http://[^/\"]+/|https://[^/\"]+/", site_str)

But, is there a shorter way to write this? I'm repeating ://[^/\"]+/ twice, probably without any need. I tried various stuff, but it doesn't work. I tried:

L = re.findall(r"http|https(://[^/\"]+/)", site_str)
L = re.findall(r"(http|https)://[^/\"]+/", site_str)
L = re.findall(r"(http|https)(://[^/\"]+/)", site_str)

It's obvious I'm missing something here or I just don't understand python regexes enough.

Martijn Pieters · Accepted Answer

You are using capturing groups, and .findall() alters behaviour when you use those (it'll only return the contents of capturing groups). Your regex can be simplified, but your versions will work if you use non-capturing groups instead:

L = re.findall(r"(?:http|https)://[^/\"]+/", site_str)

You don't need to escape the double quote if you use single quotes around the expression, and you only need to vary the s in the expression, so s? would work too:

L = re.findall(r'https?://[^/"]+/', site_str)

Demo:

>>> import re
>>> example = '''
... "http://someserver.com/"
... "https://anotherserver.com/with/path"
... '''
>>> re.findall(r'https?://[^/"]+/', example)
['http://someserver.com/', 'https://anotherserver.com/']

Python regex alternation

Answers (1)

Related Questions