XPath with LXML Element

Question

I am trying to parse an XML document using lxml etree. The XML doc I am parsing looks like this:


	
    
        
            
                
                    Test Title
                
                
                    
                
            
        
        
            
                
                    Test Title 2
                    101
                
                
                    TestAuthEntry
                
                
                    Yes
                
                
                
                    1
                
            
            
                
                    2009
                    2010
                    CLASS
                    ffdsf
                
                This is an abstract piece of text.
                
                    2020
                    UK
                    Test
                    test
                    hello
                    fdsfdsf
                
            
            
                
                    test timemeth
                    test data collector
                    test sampprocess
                    test deviat
                    test collMode
                    
                
            
            
                
                    Test accsPlac
                
                
                    NONE
                
            
            
                122
                12332

I wrote the following code to try and process it:

from lxml import etree
import pdb

f = open('/vagrant/out2.xml', 'r')
xml_str = f.read()

xml_doc = etree.fromstring(xml_str)

f.close()

From what I understand from the lxml xpath docs, I should be able to get the text from a specific element as follows:

xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')

However, when I run this it returns an empty array.

The only xpath I can get to return something is using a wildcard:

xml_doc.xpath('*')

Which returns [].

I've read through the docs and I'm not understanding what is going wrong with this. Any help is appreciated.

Martin Honnen · Accepted Answer

You need to take the default namespace into account so instead of

xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')

use

xml_doc.xpath.xpath(
    '/oai:metadata/ddi:codeBook/ddi:docDscr/ddi:citation/ddi:titlStmt/ddi:titl/text()',
    namespaces={
        'oai': 'http://www.openarchives.org/OAI/2.0/', 
        'ddi': 'ddi:codebook:2_5'
    }
)

XPath with LXML Element

Answers (1)

Related Questions