Reputation: 451
My problem is, that I want to match URLs in HTML code, which look like so: href='example.com'
or using "
, but I only want to extract the actual URL. I tried matching it, and then using array magic to only get the array, but since the regex match is greedy, if there is more than 1 rational match, there will be lots more which start at one '
and end at another URL's '
. What regex will suit my needs?
Upvotes: 2
Views: 2303
Reputation: 1723
I would recommend NOT using regex to parse HTML. Your life will be much easier if you use something like beautifulsoup
!
It's as easy as this:
from BeautifulSoup import BeautifulSoup
HTML = """<a href="https://firstwebsite.com">firstone</a><a href="https://secondwebsite.com">Ihaveurls</a>"""
s = BeautifulSoup(HTML)
for href in s.find_all('a', href=True): print("My URL: ", href['href'])
Upvotes: 3
Reputation: 11
In case if you want it to solve it using regular expression instead of using other libraries of python. Here is the solution.
import re
html = '<a href="https://www.abcde.com"></a>'
pattern = r'href=\"(.*)\"|href=\'(.*)\''
multiple_match_links = re.findall(pattern,html)
if(len(multiple_match_links) == 0):
print("No Link Found")
else:
print([x for x in list(multiple_match_links[0]) if len(x) > 0][0])
Upvotes: 1