Reputation: 87
I'm trying to find a string inside a HTML page with known patterns. for example, in the following HTML code:
<TABLE WIDTH="100%">
<TR><TD ALIGN="LEFT" width="50%"> </TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM WIDTH=50%><FONT SIZE=-1>( <STRONG>1</STRONG></FONT> <FONT SIZE=-2>of</FONT> <STRONG><FONT SIZE=-1>1</STRONG> )</FONT></TD></TR></TABLE>
<HR>
<TABLE WIDTH="100%">
<TR> <TD ALIGN="LEFT" WIDTH="50%"><B>String 1</B></TD>
<TD ALIGN="RIGHT" WIDTH="50%"><B><A Name=h1 HREF=#h0></A><A HREF=#h2></A><B><I></I></B>String</B></TD>
</TR>
<TR><TD ALIGN="LEFT" WIDTH="50%"><b>String 2.</B>
</TD>
<TD ALIGN="RIGHT" WIDTH="50%"> <B>
String 3
</B></TD>
</TR>
</TABLE>
<HR>
<font size="+1">String 4</font><BR>
...
I want to find String 4 , and I know that it will always be between
<HR><font size="+1">
and </font><BR>
how can I search for the string using RE?
edit:
I've tried the following, but no success:
p = re.match('<HR><font size="+1">(.*?)</font><BR>',html)
thanks.
Upvotes: 1
Views: 2973
Reputation: 2804
re.findall(r'<HR>\s*<font size="\+1">(.*?)</font><BR>', html, re.DOTALL)
findall
is returning a list with everything that is captured between the brackets in the regular expression. I used re.DOTALL so the dot also captures end of lines.
I used \s*
because I was not sure whether there would be any whitespace.
Upvotes: 4
Reputation: 5738
re.findall(r'<HR>\n<font size="\+1">([^<]*)<\/font><BR>', html, re.MULTILINE)
Upvotes: 0
Reputation: 1308
This works, but may not be very robust:
import re
r = re.compile('<HR>\s?<font size="\+1">(.+?)</font>\s?<BR>', re.IGNORECASE)
r.findall(html)
You will be better off using a proper HTML parser. BeautifulSoup is excellent and easy to use. Look it up.
Upvotes: 2