Reputation: 577
I'm trying to extract the hyperlinks from a webpage using regex in Python.
suppose my text string is:
text = '<a href="/status/ALL">ALL</a></td>/n<a href="/status/ASSIGN">ASSIGN</a></td>'
and I want to extract ALL and ASSIGN, I'm using this regular expression:
re.findall(r'<a href=.*>(\w+)</a>', text, re.DOTALL)
this just returns ASSIGN.
Can someone please help me in pointing out the mistake in the regular expression? I'm really new to this topic.
Upvotes: 1
Views: 872
Reputation: 1121824
You are using a regular expression, and matching XML with such expressions get too complicated, too fast.
Please don't make it hard on yourself and use a HTML parser instead, Python has several to choose from:
ElementTree example:
from xml.etree import ElementTree
tree = ElementTree.parse('filename.html')
for elem in tree.findall('a'):
print ElementTree.tostring(elem)
Upvotes: 2