Python: Why is Xpath seemingly only processing the first element in this tree?

Question

Suppose I have this:






-www.example.org-








  
    
      
        
          
        
      
      
        
          001.jpg

          300 x 300 （806 KB）

        
      
    
  
  
    
      
        
          
        
      
      
        
          002.jpg

          300 x 300 （627 KB）

And I want to find all the urls in the page, and do:

tree = lxml.html.parse('example.html')
links = tree.xpath('//a/@href')

Yet I only get the first one (001.html). Why is that? I've tried manually iterating over tree after using getroot() and it seems only the first table with the first url is visible. I don't understand.

Edit: I tested again with the example I posted and it actually worked, and after some testing, it seems as if I remove the head, it works... Maybe something in it is breaking the parser? I dunno. I guess the best way to solve this would be to search the file and remove anything between the and ? Since I can't parse it due to the parse not working as expected. So I've added the head to the example for it to break.

ekhumoro · Accepted Answer

Using the example html file and this script:

from lxml import etree

parser = etree.HTMLParser(encoding='utf8')
tree = etree.parse('source.html', parser)
print tree.xpath('//a/@href')

Gives:

['001.html', '002.html']

Python: Why is Xpath seemingly only processing the first element in this tree?

Answers (2)

Related Questions