Extracting the hyperlink from link tag using xpath

Question

Consider the html as


this is the title
www.linktoawebsite.com

I am using Lxml(python) and Xpath and trying to extract both the content of the title tag as well as the link tag. The code is

page=urllib.urlopen(url).read()
x=etree.HTML(page)
titles=x.xpath('//item/title/text()')
links=x.xpath('//item/link/text()')

But this is returning an empty list. However, this is returning a link element.

links=x.xpath('//item/link')        #returns

Can anyone suggest how to extract the urls from the link tag?

Martijn Pieters · Accepted Answer

You are using the wrong parser for the job; you don't have HTML, you have XML.

A proper HTML parser will ignore the contents of a tag, because in the HTML specification that tag is always empty.

Use the etree.parse() function to parse your URL stream (no separate .read() call needed):

response = urllib.urlopen(url)
tree = etree.parse(response)

titles = tree.xpath('//item/title/text()')
links = tree.xpath('//item/link/text()')

You could also use etree.fromstring(page) but leaving the reading to the parser is easier.

Answers (2)