Yoan Arnaudov
Yoan Arnaudov

Reputation: 4144

Python regex match in matched elements with one regex

let's say I have this html code:

<table id="test_table">
    <td>
        <a href="#">#</a>
        <a href="#">#</a>
        <a href="#">#</a>
        <a href="#">#</a>
        <a href="#">#</a>
        <a href="#">#</a>
        <a href="#">#</a>
        <a href="#">#</a>
    </td>
</table>
<table id="test_table2">
    <td>
        <a href="#">#33</a>
        <a href="#">#33</a>
        <a href="#">#33</a>
        <a href="#">#33</a>
        <a href="#">#33</a>
        <a href="#">#33</a>
        <a href="#">#33</a>
        <a href="#">#33</a>
    </td>
</table>

I want to match hrefs only in #test_table and iterate them? I tried something like this:

<table id="test_table">\s*<td>(\s*<a href="(?P<url>.*?)">(?P<anchor>.*?)</a>\n)*

But this only matches the first element, I'm stuck on this for a couple of hours and I can't get it right, thank you for your help.

Upvotes: 2

Views: 128

Answers (4)

user1149913
user1149913

Reputation: 4523

Your regex does capture the correct portion of HTML.

The problem is that when you have a capturing group that ends with a + or * (for example ((?P<anchor>.*?)* ), only the final group is returned by the groups() method.

For instance:

sss='''<table id="test_table">
    <td>
        <a href="#">#</a>
        <a href="#">#</a>
        <a href="#">#</a>
        <a href="#">#</a>
        <a href="#">#</a>
        <a href="#">#</a>
        <a href="#">#</a>
        <a href="#last_url">#last_anch</a>
    </td>
</table>
<table id="test_table2">
    <td>
        <a href="#">#33</a>
        <a href="#">#33</a>
        <a href="#">#33</a>
        <a href="#">#33</a>
        <a href="#">#33</a>
        <a href="#">#33</a>
        <a href="#">#33</a>
        <a href="#">#33</a>
    </td>
</table>'''

import re

res='<table id="test_table">\s*<td>(\s*<a href="(?P<url>.*?)">(?P<anchor>.*?)</a>\n)*'
m=re.search(res,sss)
print m.groups()

outputs:

('        <a href="#last_url">#last_anch</a>\n', '#last_url', '#last_ach')

I don't agree with the other posters that you should always use a dedicated HTML processor like BeautifulSoup. These can have high overhead and, for easy tasks, can take longer to code.

An alternative would be to use two re's as below:

res='<table id="test_table">.*?</table>'
mm=re.search(res,sss,re.DOTALL)
results=[m.group('url','anchor') for m in re.finditer('<a href="(?P<url>.*?)">(?P<anchor>.*?)</a>',mm.group())]

Upvotes: 0

Bryan
Bryan

Reputation: 17703

Also take a look at PyQuery, I like the jQuery familiarity it offers:

>>> from pyquery import PyQuery as pq
>>> html = '''<table id="test_table">
...     <td>
...         <a href="#">#</a>
...         <a href="#">#</a>
...         <a href="#">#</a>
...         <a href="#">#</a>
...         <a href="#">#</a>
...         <a href="#">#</a>
...         <a href="#">#</a>
...         <a href="#">#</a>
...     </td>
... </table>
... <table id="test_table2">
...     <td>
...         <a href="#">#33</a>
...         <a href="#">#33</a>
...         <a href="#">#33</a>
...         <a href="#">#33</a>
...         <a href="#">#33</a>
...         <a href="#">#33</a>
...         <a href="#">#33</a>
...         <a href="#">#33</a>
...     </td>
... </table>'''
>>> d = pq(html)
>>> for a in d('#test_table').find('a'):
...     print a.attrib.items()
...
...
[('href', '#')]
[('href', '#')]
[('href', '#')]
[('href', '#')]
[('href', '#')]
[('href', '#')]
[('href', '#')]
[('href', '#')]

Upvotes: 0

user689383
user689383

Reputation:

Do not use regex to parse HTML, use LXML for this.

Example using iPython (test is your file)

In [55]: import lxml.html

In [56]: x = lxml.html.fromstring(open("test").read())

In [57]: for i in x.iterlinks():
    print i # print ALL links 
   ....:     
(<Element a at 0x1bb7110>, 'href', '#', 0)
(<Element a at 0x1ba8c50>, 'href', '#', 0)
(<Element a at 0x1ba89b0>, 'href', '#', 0)
(<Element a at 0x1ba8e30>, 'href', '#', 0)
(<Element a at 0x1ba8c50>, 'href', '#', 0)
(<Element a at 0x1ba89b0>, 'href', '#', 0)
(<Element a at 0x1ba8e30>, 'href', '#', 0)
(<Element a at 0x1ba8c50>, 'href', '#', 0)
(<Element a at 0x1bb7110>, 'href', '#', 0)
(<Element a at 0x1ba89b0>, 'href', '#', 0)
(<Element a at 0x1ba8c50>, 'href', '#', 0)
(<Element a at 0x1ba8e30>, 'href', '#', 0)
(<Element a at 0x1ba89b0>, 'href', '#', 0)
(<Element a at 0x1ba8c50>, 'href', '#', 0)
(<Element a at 0x1ba8e30>, 'href', '#', 0)
(<Element a at 0x1ba89b0>, 'href', '#', 0)

In [58]: path = x.xpath("./table[@id='test_table']")[0]

In [59]: for i in path.iterlinks():
   ....:     print i
   ....:     
(<Element a at 0x1bb7110>, 'href', '#', 0)
(<Element a at 0x1bb7050>, 'href', '#', 0)
(<Element a at 0x1ba89b0>, 'href', '#', 0)
(<Element a at 0x1ba8e30>, 'href', '#', 0)
(<Element a at 0x1bb7050>, 'href', '#', 0)
(<Element a at 0x1ba89b0>, 'href', '#', 0)
(<Element a at 0x1ba8e30>, 'href', '#', 0)
(<Element a at 0x1bb7050>, 'href', '#', 0)

Using Xpath makes stuff much easier, less headaches and less coffee ;)

Upvotes: 1

Martijn Pieters
Martijn Pieters

Reputation: 1121484

For HTML, use the right tool. Use an HTML parser instead, like BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

table = soup.find('table', id='test_table')
for anchor in table.find_all('a'):
    print anchor['href'], anchor.string

Do not use a regular expression, matching HTML with such expressions gets too complicated, too fast. Don't do that.

Upvotes: 3

Related Questions