Taranjeet
Taranjeet

Reputation: 581

Extracting the hyperlink from link tag using xpath

Consider the html as

<item>
<title>this is the title</title>
<link>www.linktoawebsite.com</link>
</item>

I am using Lxml(python) and Xpath and trying to extract both the content of the title tag as well as the link tag. The code is

page=urllib.urlopen(url).read()
x=etree.HTML(page)
titles=x.xpath('//item/title/text()')
links=x.xpath('//item/link/text()')

But this is returning an empty list. However, this is returning a link element.

links=x.xpath('//item/link')        #returns <Element link at 0xb6b0ae0c>

Can anyone suggest how to extract the urls from the link tag?

Upvotes: 0

Views: 1827

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1121834

You are using the wrong parser for the job; you don't have HTML, you have XML.

A proper HTML parser will ignore the contents of a <link> tag, because in the HTML specification that tag is always empty.

Use the etree.parse() function to parse your URL stream (no separate .read() call needed):

response = urllib.urlopen(url)
tree = etree.parse(response)

titles = tree.xpath('//item/title/text()')
links = tree.xpath('//item/link/text()')

You could also use etree.fromstring(page) but leaving the reading to the parser is easier.

Upvotes: 1

Vivek Sable
Vivek Sable

Reputation: 10213

By parsing content by etree, the <link> tag get closed. So no text value present for link tag

Demo:

>>> from lxml import etree
>>> content = """<item>
... <title>this is the title</title>
... <link>www.linktoawebsite.com</link>
... </item>"""
>>> x = etree.HTML(content)
>>> etree.tostring(x)
'<html><body><item>\n<title>this is the title</title>\n<link/>www.linktoawebsite.com\n</item></body></html>'
>>> 

According to HTML, this is not valid tag.

I think link tag structure is like:

<head>
<link rel="stylesheet" type="text/css" href="theme.css">
</head> 

Upvotes: 1

Related Questions