How to match URLs with python regular expression?

Question

My problem is, that I want to match URLs in HTML code, which look like so: href='example.com' or using ", but I only want to extract the actual URL. I tried matching it, and then using array magic to only get the array, but since the regex match is greedy, if there is more than 1 rational match, there will be lots more which start at one ' and end at another URL's '. What regex will suit my needs?

PixelEinstein · Accepted Answer

I would recommend NOT using regex to parse HTML. Your life will be much easier if you use something like beautifulsoup!

It's as easy as this:

from BeautifulSoup import BeautifulSoup

HTML = """firstoneIhaveurls"""

s = BeautifulSoup(HTML)

for href in s.find_all('a', href=True): print("My URL: ", href['href'])

How to match URLs with python regular expression?

Answers (2)

Related Questions