Correct Xpath for element

Question

I am trying to scrape some data from this page.

I am using requests and lxml in python to do so. Specifically I want the ids of the detected topics.

I wrote the following Xpath for them :

'//detectedTopic//@id'

This returned nothing.

Whereas the following worked without any issues:

'//@id'

The developer tools in Chrome showed that the first Xpath indeed points to the correct nodes.

What's wrong with it then?

unutbu · Accepted Answer

If you use lxml.html to parse the content, then the HTMLParser makes all the tags lowercase since HTML is case-insensitive:

import requests
url = 'http://wikipedia-miner.cms.waikato.ac.nz/services/wikify?source=At%20around%20the%20size%20of%20a%20domestic%20chicken,%20kiwi%20are%20by%20far%20the%20smallest%20living%20ratites%20and%20lay%20the%20largest%20egg%20in%20relation%20to%20their%20body%20size%20of%20any%20species%20of%20bird%20in%20the%20world'
r = requests.get(url)
content = r.content

import lxml.html as LH
html_root = LH.fromstring(content)
print(LH.tostring(html_root))

yields

...

but if you use lxml.etree to parse the content as XML, then the case is not changed:

import lxml.etree as ET
xml_root = ET.fromstring(content)
print(ET.tostring(xml_root))

yields

...

The content looks like XML not HTML, so you should use:

print(xml_root.xpath('//detectedTopic/@id'))
['17362', '21780446', '160220', '37402']

If content is parsed as HTML, then the XPath would need to be lowercased:

print(html_root.xpath('//detectedtopic/@id'))
['17362', '21780446', '160220', '37402']

Correct Xpath for element

Answers (2)

Related Questions