Reputation: 35
Suppose I have this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML LANG="ja">
<HEAD>
<META http-equiv="Content-Type" content="text/html; charset=Shift_JIS">
<META name="GENERATOR" content="snanail Version 2.18">
<TITLE>-www.example.org-</TITLE>
<STYLE type="text/css">
<!--
H1.TITLE {
font-size : 10 pt;
font-family : "Arial";
color : #FFFFFF;
}
-->
</STYLE>
</HEAD>
<BODY>
<CENTER>
<TABLE BORDER="0" CELLSPACING="1" CELLPADDING="6" ALIGN="CENTER">
<TR>
<TD WIDTH="100">
<TABLE ALIGN="CENTER" CELLPADDING="4" CELLSPACING="1">
<TR>
<TD HEIGHT="100" WIDTH= "68" ALIGN="CENTER" VALIGN="MIDDLE">
<A HREF="001.html" TARGET="_blank"><IMG SRC="001_thumb.png" WIDTH="56" HEIGHT="80" ALT="001_thumb.png" BORDER="0"></A>
</TD>
</TR>
<TR>
<TD HEIGHT="40" ALIGN="CENTER" VALIGN="MIDDLE">
<FONT SIZE="2" COLOR="#FFFFFF">001.jpg</FONT><BR>
<FONT SIZE="2" COLOR="#FFFFFF">300 x 300 (806 KB)</FONT><BR>
</TD>
</TR>
</TABLE>
</TD>
<TD WIDTH="100">
<TABLE ALIGN="CENTER" CELLPADDING="4" CELLSPACING="1">
<TR>
<TD HEIGHT="100" WIDTH= "68" ALIGN="CENTER" VALIGN="MIDDLE">
<A HREF="002.html" TARGET="_blank"><IMG SRC="002_thumb.png" WIDTH="56" HEIGHT="80" ALT="002_thumb.png" BORDER="0"></A>
</TD>
</TR>
<TR>
<TD HEIGHT="40" ALIGN="CENTER" VALIGN="MIDDLE">
<FONT SIZE="2" COLOR="#FFFFFF">002.jpg</FONT><BR>
<FONT SIZE="2" COLOR="#FFFFFF">300 x 300 (627 KB)</FONT><BR>
</TD>
</TR>
</TABLE>
</TD>
</TR>
</TABLE>
</CENTER>
</HTML>
And I want to find all the urls in the page, and do:
tree = lxml.html.parse('example.html')
links = tree.xpath('//a/@href')
Yet I only get the first one (001.html). Why is that? I've tried manually iterating over tree after using getroot()
and it seems only the first table with the first url is visible. I don't understand.
Edit: I tested again with the example I posted and it actually worked, and after some testing, it seems as if I remove the head, it works... Maybe something in it is breaking the parser? I dunno. I guess the best way to solve this would be to search the file and remove anything between the <head>
and </head>
? Since I can't parse it due to the parse not working as expected. So I've added the head to the example for it to break.
Upvotes: 2
Views: 230
Reputation: 120678
Using the example html file and this script:
from lxml import etree
parser = etree.HTMLParser(encoding='utf8')
tree = etree.parse('source.html', parser)
print tree.xpath('//a/@href')
Gives:
['001.html', '002.html']
Upvotes: 1
Reputation: 23303
did you try declaring your document as XHTML ?
the doctype at the beginning of your example tells that you are using HTML, which is NOT valid XML, thus an xml parser will likely stop processing the input just after the doctype. remember that XPath needs a valid XML input in order to work.
so, if you use an XHTML doctype, the XML parser would no more break on the doctype, and parse the input in its entirety.
Upvotes: 0