kieran
kieran

Reputation: 548

Extract value from element when second namespace is used in lxml

I am able to extract values from elements (using lxml in python 2.7) when one namespace is used. However I can't figure out how to extract values when a second namespace is used. I want to extract the value within //cc-cpl:MainClosedCaption/Id but I keep getting lxml.etree.XPathEvalError: Invalid expression errors. To be specific, the value I'm trying to exract from my sample xml is urn:uuid:6ca58b51-9116-4131-8652-feaed20dca0d

Here's a snipped of the xml (from a Digital Cinema Package):

<?xml version="1.0" encoding="UTF-8"?>
<CompositionPlaylist xmlns="http://www.digicine.com/PROTO-ASDCP-CPL-20040511#">
    <Reel>
      <Id>urn:uuid:58cf368f-ed30-40d8-9258-dd7572035b69</Id>
        <MainPicture>
          <Id>urn:uuid:afe91f7a-6451-4b9f-be2e-345f9a28da6d</Id>
        </MainPicture>
        <cc-cpl:MainClosedCaption xmlns:cc-cpl="http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#">
          <Id>urn:uuid:6ca58b51-9116-4131-8652-feaed20dca0d</Id>
        </cc-cpl:MainClosedCaption>
    </Reel>
</CompositionPlaylist>

Here is an example of code that works:

from lxml import etree
cpl_parse = etree.parse('filename.xml')
pkl_namespace = cpl_parse.xpath('namespace-uri(.)') 
xmluuid =  cpl_parse.xpath('//ns:MainPicture/ns:Id',namespaces={'ns': pkl_namespace})
for i in xmluuid:
    print i.text

When I try to specify the following xpath instead: //ns:MainClosedCaption/ns:Id - I end up with errors.

When I specify the namespace with: pkl_namespace = 'http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#"'

I receive a lxml.etree.XPathEvalError: Invalid expression error

I know this is a stupid attempt, but the following produced the same error: '//ns:cc-cpl:MainClosed Caption/ns:cc-cpl:Id'

I tried to include the two namespaces in a dictionary as in this answer: https://stackoverflow.com/a/36227869/2188572 , and while I don't get any errors, I end up with no values extracted. Here's my dictionary:

namespaces = {
    'ns': 'http://www.digicine.com/PROTO-ASDCP-CPL-20040511#',
    'ns2': 'http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#',
}

and my command:

xmluuid =  cpl_parse.xpath('//ns:AssetList/ns2:MainClosedCaption/ns2:Id',namespaces=namespaces)

I found this, Extracting nested namespace from a xml using lxml which is actually the exact same kind of xml that I'm working on, but his request was to get the namespace URL, not the actual values of elements.

Edit: Using the method from the previous answer to extract the namespace, I tried the following, but got the same errors:

from lxml import etree
import sys
filename = sys.argv[1]

cpl_parse = etree.parse(filename)
pkl_namespace = etree.QName(cpl_parse.find('.//{*}MainClosedCaption')).namespace
print pkl_namespace
xmluuid =  cpl_parse.xpath('//ns:cc-cpl:MainClosedCaption/ns:cc-cpl:Id',namespaces={'ns': pkl_namespace})
for i in xmluuid:
    print i.text

and here's the errors in full:

Traceback (most recent call last):
  File "sub.py", line 8, in <module>
    xmluuid =  cpl_parse.xpath('//ns:cc-cpl:MainClosedCaption/ns:cc-cpl:Id',namespaces={'ns': pkl_namespace})
  File "lxml.etree.pyx", line 2115, in lxml.etree._ElementTree.xpath (src/lxml/lxml.etree.c:57654)
  File "xpath.pxi", line 370, in lxml.etree.XPathDocumentEvaluator.__call__ (src/lxml/lxml.etree.c:146564)
  File "xpath.pxi", line 238, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:144962)
  File "xpath.pxi", line 224, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:144817)
lxml.etree.XPathEvalError: Invalid expression

Upvotes: 2

Views: 792

Answers (1)

phihag
phihag

Reputation: 288260

The Id element in MainClosedCaption belongs to the 2004 namespace. Only an attribute xmlns="..." can change the default namespace; attributes of the form xmlns:something="..." only add a namespace which has to be explicitly declared.

Try this:

from lxml import etree
cpl_parse = etree.parse('filename.xml')
xmluuid = cpl_parse.xpath('//proto2007:MainClosedCaption/proto2004:Id', namespaces={
    'proto2004': 'http://www.digicine.com/PROTO-ASDCP-CPL-20040511#',
    'proto2007': 'http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#',
})
for i in xmluuid:
    print(i.text)

Upvotes: 2

Related Questions