Sreejith Sasidharan
Sreejith Sasidharan

Reputation: 1378

matching multiple line in python regular expression

I want to extract the data between <tr> tags from an html page. I used the following code.But i didn't get any result. The html between the <tr> tags is in multiple lines

category =re.findall('<tr>(.*?)</tr>',data);

Please suggest a fix for this problem.

Upvotes: 13

Views: 15249

Answers (5)

Tendayi Mawushe
Tendayi Mawushe

Reputation: 26138

As other have suggested the specific problem you are having can be resolved by allowing multi-line matching using re.MULTILINE

However you are going down a treacherous patch parsing HTML with regular expressions. Use an XML/HTML parser instead, BeautifulSoup works great for this!

doc = """<table border="1">
    <tr>
        <td>row 1, cell 1</td>
        <td>row 1, cell 2</td>
    </tr>
    <tr>
        <td>row 2, cell 1</td>
        <td>row 2, cell 2</td>
    </tr>
</table>"""

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(doc)
all_trs = soup.findAll("tr")

Upvotes: 0

ghostdog74
ghostdog74

Reputation: 343211

pat=re.compile('<tr>(.*?)</tr>',re.DOTALL|re.M)
print pat.findall(data)

Or non regex way,

for item in data.split("</tr>"):
    if "<tr>" in item:
       print item[item.find("<tr>")+len("<tr>"):]

Upvotes: 2

SilentGhost
SilentGhost

Reputation: 320049

just to clear up the issue. Despite all those links to re.M it wouldn't work here as simple skimming of the its explanation would reveal. You'd need re.S, if you wouldn't try to parse html, of course:

>>> doc = """<table border="1">
    <tr>
        <td>row 1, cell 1</td>
        <td>row 1, cell 2</td>
    </tr>
    <tr>
        <td>row 2, cell 1</td>
        <td>row 2, cell 2</td>
    </tr>
</table>"""

>>> re.findall('<tr>(.*?)</tr>', doc, re.S)
['\n        <td>row 1, cell 1</td>\n        <td>row 1, cell 2</td>\n    ', 
 '\n        <td>row 2, cell 1</td>\n        <td>row 2, cell 2</td>\n    ']
>>> re.findall('<tr>(.*?)</tr>', doc, re.M)
[]

Upvotes: 18

Mark Byers
Mark Byers

Reputation: 839264

Don't use regex, use a HTML parser such as BeautifulSoup:

html = '<html><body>foo<tr>bar</tr>baz<tr>qux</tr></body></html>'

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
print soup.findAll("tr")

Result:

[<tr>bar</tr>, <tr>qux</tr>]

If you just want the contents, without the tr tags:

for tr in soup.findAll("tr"):
    print tr.contents

Result:

bar
qux

Using an HTML parser isn't as scary as it sounds! And it will work more reliably than any regex that will be posted here.

Upvotes: 5

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 799580

Do not use regular expressions to parse HTML. Use an HTML parser such as lxml or BeautifulSoup.

Upvotes: 2

Related Questions