Reputation: 4613
I have this XML code, drawn from this link:
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:nyt="http://www.nytimes.com/namespaces/rss/2.0" version="2.0">
<channel>
<item>
<title>‘This Did Not Go Well’: Inside PG&E’s Blackout Control Room</title>
<dc:creator>Ivan Penn</dc:creator>
<pubDate>Sat, 12 Oct 2019 17:03:11 +0000</pubDate>
</item>
</channel>
</rss>
When I try to parse it using lxml
and following the documentation for xpath and XML namespaces, the parser finds the title (which doesn't use a namespace) but not the authors/creators, which does:
from lxml import html
xml = """
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:nyt="http://www.nytimes.com/namespaces/rss/2.0" version="2.0">
<channel>
<item>
<title>‘This Did Not Go Well’: Inside PG&E’s Blackout Control Room</title>
<dc:creator>Ivan Penn</dc:creator>
<pubDate>Sat, 12 Oct 2019 17:03:11 +0000</pubDate>
</item>
</channel>
</rss>
"""
rss = html.fromstring(xml)
items = rss.xpath("//item")
for item in items:
title = item.xpath("title")[0].text_content().strip()
print(title)
ns = {"dc" : "http://purl.org/dc/elements/1.1/"}
authors = item.xpath("dc:creator", namespaces = ns)
print(authors)
This code prints:
This Did Not Go Well’: Inside PG&E’s Blackout Control Room []
Since it finds the contents of the title tag correctly I think it's finding the individual <item>
tags. Is there something wrong with how I'm passing the namespace to xpath
?
EDIT: The result is the same whether or not I use the trailing slash, i.e.
ns = {"dc" : "http://purl.org/dc/elements/1.1/"}
ns = {"dc" : "http://purl.org/dc/elements/1.1"}
Upvotes: 2
Views: 180
Reputation: 50947
The HTML parser ignores namespaces. This is the last sentence in the Running HTML doctests section in the lxml documentation:
The HTML parser notably ignores namespaces and some other XMLisms.
Another part of the documentation says:
Also note that the HTML parser is meant to parse HTML documents. For XHTML documents, use the XML parser, which is namespace aware.
It will work if you change
authors = item.xpath("dc:creator", namespaces = ns)
to
authors = item.xpath("creator")
But since RSS is not HTML, consider using the XML parser (from lxml import etree
).
Upvotes: 2