user1017304
user1017304

Reputation: 35

Python: Why is Xpath seemingly only processing the first element in this tree?

Suppose I have this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML LANG="ja">
<HEAD>
<META http-equiv="Content-Type" content="text/html; charset=Shift_JIS">
<META name="GENERATOR" content="snanail Version 2.18">
<TITLE>-www.example.org-</TITLE>

<STYLE type="text/css">
<!--
H1.TITLE {
font-size : 10 pt;
font-family : "Arial";
color : #FFFFFF;
}
-->
</STYLE>

</HEAD>
<BODY>
<CENTER>
<TABLE BORDER="0" CELLSPACING="1" CELLPADDING="6" ALIGN="CENTER">
<TR>
  <TD WIDTH="100">
    <TABLE ALIGN="CENTER" CELLPADDING="4" CELLSPACING="1">
      <TR>
        <TD HEIGHT="100" WIDTH= "68" ALIGN="CENTER" VALIGN="MIDDLE">
          <A HREF="001.html" TARGET="_blank"><IMG SRC="001_thumb.png" WIDTH="56" HEIGHT="80" ALT="001_thumb.png" BORDER="0"></A>
        </TD>
      </TR>
      <TR>
        <TD HEIGHT="40" ALIGN="CENTER" VALIGN="MIDDLE">
          <FONT SIZE="2" COLOR="#FFFFFF">001.jpg</FONT><BR>
          <FONT SIZE="2" COLOR="#FFFFFF">300 x 300 (806 KB)</FONT><BR>
        </TD>
      </TR>
    </TABLE>
  </TD>
  <TD WIDTH="100">
    <TABLE ALIGN="CENTER" CELLPADDING="4" CELLSPACING="1">
      <TR>
        <TD HEIGHT="100" WIDTH= "68" ALIGN="CENTER" VALIGN="MIDDLE">
          <A HREF="002.html" TARGET="_blank"><IMG SRC="002_thumb.png" WIDTH="56" HEIGHT="80" ALT="002_thumb.png" BORDER="0"></A>
        </TD>
      </TR>
      <TR>
        <TD HEIGHT="40" ALIGN="CENTER" VALIGN="MIDDLE">
          <FONT SIZE="2" COLOR="#FFFFFF">002.jpg</FONT><BR>
          <FONT SIZE="2" COLOR="#FFFFFF">300 x 300 (627 KB)</FONT><BR>
        </TD>
      </TR>
    </TABLE>
  </TD>
</TR>
</TABLE>
</CENTER>
</HTML>

And I want to find all the urls in the page, and do:

tree = lxml.html.parse('example.html')
links = tree.xpath('//a/@href')

Yet I only get the first one (001.html). Why is that? I've tried manually iterating over tree after using getroot() and it seems only the first table with the first url is visible. I don't understand.

Edit: I tested again with the example I posted and it actually worked, and after some testing, it seems as if I remove the head, it works... Maybe something in it is breaking the parser? I dunno. I guess the best way to solve this would be to search the file and remove anything between the <head> and </head>? Since I can't parse it due to the parse not working as expected. So I've added the head to the example for it to break.

Upvotes: 2

Views: 230

Answers (2)

ekhumoro
ekhumoro

Reputation: 120678

Using the example html file and this script:

from lxml import etree

parser = etree.HTMLParser(encoding='utf8')
tree = etree.parse('source.html', parser)
print tree.xpath('//a/@href')

Gives:

['001.html', '002.html']

Upvotes: 1

Adrien Plisson
Adrien Plisson

Reputation: 23303

did you try declaring your document as XHTML ?

the doctype at the beginning of your example tells that you are using HTML, which is NOT valid XML, thus an xml parser will likely stop processing the input just after the doctype. remember that XPath needs a valid XML input in order to work.

so, if you use an XHTML doctype, the XML parser would no more break on the doctype, and parse the input in its entirety.

Upvotes: 0

Related Questions