Why does this parser not find the contents of the XML tag that uses a namespace prefix?

Question

I have this XML code, drawn from this link:



  
    
      ‘This Did Not Go Well’: Inside PG&E’s Blackout Control Room
      Ivan Penn
      Sat, 12 Oct 2019 17:03:11 +0000

When I try to parse it using lxml and following the documentation for xpath and XML namespaces, the parser finds the title (which doesn't use a namespace) but not the authors/creators, which does:

from lxml import html

xml = """


  
    
      ‘This Did Not Go Well’: Inside PG&E’s Blackout Control Room
      Ivan Penn
      Sat, 12 Oct 2019 17:03:11 +0000
    
  

"""


rss = html.fromstring(xml)
items = rss.xpath("//item")
for item in items:
    title = item.xpath("title")[0].text_content().strip()
    print(title)

    ns = {"dc" : "http://purl.org/dc/elements/1.1/"}
    authors = item.xpath("dc:creator", namespaces = ns)
    print(authors)

This code prints:

This Did Not Go Well’: Inside PG&E’s Blackout Control Room []

Since it finds the contents of the title tag correctly I think it's finding the individual tags. Is there something wrong with how I'm passing the namespace to xpath?

EDIT: The result is the same whether or not I use the trailing slash, i.e.

ns = {"dc" : "http://purl.org/dc/elements/1.1/"}
ns = {"dc" : "http://purl.org/dc/elements/1.1"}

mzjn · Accepted Answer

The HTML parser ignores namespaces. This is the last sentence in the Running HTML doctests section in the lxml documentation:

The HTML parser notably ignores namespaces and some other XMLisms.

Another part of the documentation says:

Also note that the HTML parser is meant to parse HTML documents. For XHTML documents, use the XML parser, which is namespace aware.

It will work if you change

authors = item.xpath("dc:creator", namespaces = ns)

to

authors = item.xpath("creator")

But since RSS is not HTML, consider using the XML parser (from lxml import etree).

Why does this parser not find the contents of the XML tag that uses a namespace prefix?

Answers (1)

Related Questions