DaniFoldi
DaniFoldi

Reputation: 451

How to match URLs with python regular expression?

My problem is, that I want to match URLs in HTML code, which look like so: href='example.com' or using ", but I only want to extract the actual URL. I tried matching it, and then using array magic to only get the array, but since the regex match is greedy, if there is more than 1 rational match, there will be lots more which start at one ' and end at another URL's '. What regex will suit my needs?

Upvotes: 2

Views: 2303

Answers (2)

PixelEinstein
PixelEinstein

Reputation: 1723

I would recommend NOT using regex to parse HTML. Your life will be much easier if you use something like beautifulsoup!

It's as easy as this:

from BeautifulSoup import BeautifulSoup

HTML = """<a href="https://firstwebsite.com">firstone</a><a href="https://secondwebsite.com">Ihaveurls</a>"""

s = BeautifulSoup(HTML)

for href in s.find_all('a', href=True): print("My URL: ", href['href'])

Upvotes: 3

sadiq shah
sadiq shah

Reputation: 11

In case if you want it to solve it using regular expression instead of using other libraries of python. Here is the solution.

import re
html = '<a href="https://www.abcde.com"></a>'
pattern = r'href=\"(.*)\"|href=\'(.*)\''
multiple_match_links = re.findall(pattern,html)
if(len(multiple_match_links) == 0):
     print("No Link Found")
else:
     print([x for x in list(multiple_match_links[0]) if len(x) > 0][0])

Upvotes: 1

Related Questions