Reputation: 581
Consider the html as
<item>
<title>this is the title</title>
<link>www.linktoawebsite.com</link>
</item>
I am using Lxml(python) and Xpath and trying to extract both the content of the title tag as well as the link tag. The code is
page=urllib.urlopen(url).read()
x=etree.HTML(page)
titles=x.xpath('//item/title/text()')
links=x.xpath('//item/link/text()')
But this is returning an empty list. However, this is returning a link element.
links=x.xpath('//item/link') #returns <Element link at 0xb6b0ae0c>
Can anyone suggest how to extract the urls from the link tag?
Upvotes: 0
Views: 1827
Reputation: 1121834
You are using the wrong parser for the job; you don't have HTML, you have XML.
A proper HTML parser will ignore the contents of a <link>
tag, because in the HTML specification that tag is always empty.
Use the etree.parse()
function to parse your URL stream (no separate .read()
call needed):
response = urllib.urlopen(url)
tree = etree.parse(response)
titles = tree.xpath('//item/title/text()')
links = tree.xpath('//item/link/text()')
You could also use etree.fromstring(page)
but leaving the reading to the parser is easier.
Upvotes: 1
Reputation: 10213
By parsing content by etree
, the <link>
tag get closed. So no text value present for link tag
Demo:
>>> from lxml import etree
>>> content = """<item>
... <title>this is the title</title>
... <link>www.linktoawebsite.com</link>
... </item>"""
>>> x = etree.HTML(content)
>>> etree.tostring(x)
'<html><body><item>\n<title>this is the title</title>\n<link/>www.linktoawebsite.com\n</item></body></html>'
>>>
According to HTML, this is not valid tag.
I think link
tag structure is like:
<head>
<link rel="stylesheet" type="text/css" href="theme.css">
</head>
Upvotes: 1