Reputation: 1378
I want to extract the data between <tr>
tags from an html page. I used the following code.But i didn't get any result. The html between the <tr>
tags is in multiple lines
category =re.findall('<tr>(.*?)</tr>',data);
Please suggest a fix for this problem.
Upvotes: 13
Views: 15249
Reputation: 26138
As other have suggested the specific problem you are having can be resolved by allowing multi-line matching using re.MULTILINE
However you are going down a treacherous patch parsing HTML with regular expressions. Use an XML/HTML parser instead, BeautifulSoup works great for this!
doc = """<table border="1">
<tr>
<td>row 1, cell 1</td>
<td>row 1, cell 2</td>
</tr>
<tr>
<td>row 2, cell 1</td>
<td>row 2, cell 2</td>
</tr>
</table>"""
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(doc)
all_trs = soup.findAll("tr")
Upvotes: 0
Reputation: 343211
pat=re.compile('<tr>(.*?)</tr>',re.DOTALL|re.M)
print pat.findall(data)
Or non regex way,
for item in data.split("</tr>"):
if "<tr>" in item:
print item[item.find("<tr>")+len("<tr>"):]
Upvotes: 2
Reputation: 320049
just to clear up the issue. Despite all those links to re.M
it wouldn't work here as simple skimming of the its explanation would reveal. You'd need re.S
, if you wouldn't try to parse html, of course:
>>> doc = """<table border="1">
<tr>
<td>row 1, cell 1</td>
<td>row 1, cell 2</td>
</tr>
<tr>
<td>row 2, cell 1</td>
<td>row 2, cell 2</td>
</tr>
</table>"""
>>> re.findall('<tr>(.*?)</tr>', doc, re.S)
['\n <td>row 1, cell 1</td>\n <td>row 1, cell 2</td>\n ',
'\n <td>row 2, cell 1</td>\n <td>row 2, cell 2</td>\n ']
>>> re.findall('<tr>(.*?)</tr>', doc, re.M)
[]
Upvotes: 18
Reputation: 839264
Don't use regex, use a HTML parser such as BeautifulSoup:
html = '<html><body>foo<tr>bar</tr>baz<tr>qux</tr></body></html>'
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
print soup.findAll("tr")
Result:
[<tr>bar</tr>, <tr>qux</tr>]
If you just want the contents, without the tr tags:
for tr in soup.findAll("tr"):
print tr.contents
Result:
bar
qux
Using an HTML parser isn't as scary as it sounds! And it will work more reliably than any regex that will be posted here.
Upvotes: 5
Reputation: 799580
Do not use regular expressions to parse HTML. Use an HTML parser such as lxml or BeautifulSoup.
Upvotes: 2