Reputation: 577

Extracting Data from anchor tags using regex in python

I'm trying to extract the hyperlinks from a webpage using regex in Python.

suppose my text string is:

text = '<a href="/status/ALL">ALL</a></td>/n<a href="/status/ASSIGN">ASSIGN</a></td>'

and I want to extract ALL and ASSIGN, I'm using this regular expression:

re.findall(r'<a href=.*>(\w+)</a>', text, re.DOTALL)

this just returns ASSIGN.

Can someone please help me in pointing out the mistake in the regular expression? I'm really new to this topic.

Upvotes: 1

Answers (1)

Reputation: 1121824

You are using a regular expression, and matching XML with such expressions get too complicated, too fast.

Please don't make it hard on yourself and use a HTML parser instead, Python has several to choose from:

ElementTree example:

from xml.etree import ElementTree

tree = ElementTree.parse('filename.html')
for elem in tree.findall('a'):
    print ElementTree.tostring(elem)

Upvotes: 2