Reputation: 4144
let's say I have this html code:
<table id="test_table">
<td>
<a href="#">#</a>
<a href="#">#</a>
<a href="#">#</a>
<a href="#">#</a>
<a href="#">#</a>
<a href="#">#</a>
<a href="#">#</a>
<a href="#">#</a>
</td>
</table>
<table id="test_table2">
<td>
<a href="#">#33</a>
<a href="#">#33</a>
<a href="#">#33</a>
<a href="#">#33</a>
<a href="#">#33</a>
<a href="#">#33</a>
<a href="#">#33</a>
<a href="#">#33</a>
</td>
</table>
I want to match hrefs only in #test_table and iterate them? I tried something like this:
<table id="test_table">\s*<td>(\s*<a href="(?P<url>.*?)">(?P<anchor>.*?)</a>\n)*
But this only matches the first element, I'm stuck on this for a couple of hours and I can't get it right, thank you for your help.
Upvotes: 2
Views: 128
Reputation: 4523
Your regex does capture the correct portion of HTML.
The problem is that when you have a capturing group that ends with a + or * (for example ((?P<anchor>.*?)*
), only the final group is returned by the groups()
method.
For instance:
sss='''<table id="test_table">
<td>
<a href="#">#</a>
<a href="#">#</a>
<a href="#">#</a>
<a href="#">#</a>
<a href="#">#</a>
<a href="#">#</a>
<a href="#">#</a>
<a href="#last_url">#last_anch</a>
</td>
</table>
<table id="test_table2">
<td>
<a href="#">#33</a>
<a href="#">#33</a>
<a href="#">#33</a>
<a href="#">#33</a>
<a href="#">#33</a>
<a href="#">#33</a>
<a href="#">#33</a>
<a href="#">#33</a>
</td>
</table>'''
import re
res='<table id="test_table">\s*<td>(\s*<a href="(?P<url>.*?)">(?P<anchor>.*?)</a>\n)*'
m=re.search(res,sss)
print m.groups()
outputs:
(' <a href="#last_url">#last_anch</a>\n', '#last_url', '#last_ach')
I don't agree with the other posters that you should always use a dedicated HTML processor like BeautifulSoup. These can have high overhead and, for easy tasks, can take longer to code.
An alternative would be to use two re's as below:
res='<table id="test_table">.*?</table>'
mm=re.search(res,sss,re.DOTALL)
results=[m.group('url','anchor') for m in re.finditer('<a href="(?P<url>.*?)">(?P<anchor>.*?)</a>',mm.group())]
Upvotes: 0
Reputation: 17703
Also take a look at PyQuery, I like the jQuery familiarity it offers:
>>> from pyquery import PyQuery as pq
>>> html = '''<table id="test_table">
... <td>
... <a href="#">#</a>
... <a href="#">#</a>
... <a href="#">#</a>
... <a href="#">#</a>
... <a href="#">#</a>
... <a href="#">#</a>
... <a href="#">#</a>
... <a href="#">#</a>
... </td>
... </table>
... <table id="test_table2">
... <td>
... <a href="#">#33</a>
... <a href="#">#33</a>
... <a href="#">#33</a>
... <a href="#">#33</a>
... <a href="#">#33</a>
... <a href="#">#33</a>
... <a href="#">#33</a>
... <a href="#">#33</a>
... </td>
... </table>'''
>>> d = pq(html)
>>> for a in d('#test_table').find('a'):
... print a.attrib.items()
...
...
[('href', '#')]
[('href', '#')]
[('href', '#')]
[('href', '#')]
[('href', '#')]
[('href', '#')]
[('href', '#')]
[('href', '#')]
Upvotes: 0
Reputation:
Do not use regex to parse HTML, use LXML for this.
Example using iPython (test is your file)
In [55]: import lxml.html
In [56]: x = lxml.html.fromstring(open("test").read())
In [57]: for i in x.iterlinks():
print i # print ALL links
....:
(<Element a at 0x1bb7110>, 'href', '#', 0)
(<Element a at 0x1ba8c50>, 'href', '#', 0)
(<Element a at 0x1ba89b0>, 'href', '#', 0)
(<Element a at 0x1ba8e30>, 'href', '#', 0)
(<Element a at 0x1ba8c50>, 'href', '#', 0)
(<Element a at 0x1ba89b0>, 'href', '#', 0)
(<Element a at 0x1ba8e30>, 'href', '#', 0)
(<Element a at 0x1ba8c50>, 'href', '#', 0)
(<Element a at 0x1bb7110>, 'href', '#', 0)
(<Element a at 0x1ba89b0>, 'href', '#', 0)
(<Element a at 0x1ba8c50>, 'href', '#', 0)
(<Element a at 0x1ba8e30>, 'href', '#', 0)
(<Element a at 0x1ba89b0>, 'href', '#', 0)
(<Element a at 0x1ba8c50>, 'href', '#', 0)
(<Element a at 0x1ba8e30>, 'href', '#', 0)
(<Element a at 0x1ba89b0>, 'href', '#', 0)
In [58]: path = x.xpath("./table[@id='test_table']")[0]
In [59]: for i in path.iterlinks():
....: print i
....:
(<Element a at 0x1bb7110>, 'href', '#', 0)
(<Element a at 0x1bb7050>, 'href', '#', 0)
(<Element a at 0x1ba89b0>, 'href', '#', 0)
(<Element a at 0x1ba8e30>, 'href', '#', 0)
(<Element a at 0x1bb7050>, 'href', '#', 0)
(<Element a at 0x1ba89b0>, 'href', '#', 0)
(<Element a at 0x1ba8e30>, 'href', '#', 0)
(<Element a at 0x1bb7050>, 'href', '#', 0)
Using Xpath makes stuff much easier, less headaches and less coffee ;)
Upvotes: 1
Reputation: 1121484
For HTML, use the right tool. Use an HTML parser instead, like BeautifulSoup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
table = soup.find('table', id='test_table')
for anchor in table.find_all('a'):
print anchor['href'], anchor.string
Do not use a regular expression, matching HTML with such expressions gets too complicated, too fast. Don't do that.
Upvotes: 3