LondonAppDev
LondonAppDev

Reputation: 9663

XPath with LXML Element

I am trying to parse an XML document using lxml etree. The XML doc I am parsing looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/">\t
    <codeBook version="2.5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="ddi:codebook:2_5" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd">
        <docDscr>
            <citation>
                <titlStmt>
                    <titl>Test Title</titl>
                </titlStmt>
                <prodStmt>
                    <prodDate/>
                </prodStmt>
            </citation>
        </docDscr>
        <stdyDscr>
            <citation>
                <titlStmt>
                    <titl>Test Title 2</titl>
                    <IDNo agency="UKDA">101</IDNo>
                </titlStmt>
                <rspStmt>
                    <AuthEnty>TestAuthEntry</AuthEnty>
                </rspStmt>
                <prodStmt>
                    <copyright>Yes</copyright>
                </prodStmt>
                <distStmt/>
                <verStmt>
                    <version date="">1</version>
                </verStmt>
            </citation>
            <stdyInfo>
                <subject>
                    <keyword>2009</keyword>
                    <keyword>2010</keyword>
                    <topcClas>CLASS</topcClas>
                    <topcClas>ffdsf</topcClas>
                </subject>
                <abstract>This is an abstract piece of text.</abstract>
                <sumDscr>
                    <timePrd event="single">2020</timePrd>
                    <nation>UK</nation>
                    <anlyUnit>Test</anlyUnit>
                    <universe>test</universe>
                    <universe>hello</universe>
                    <dataKind>fdsfdsf</dataKind>
                </sumDscr>
            </stdyInfo>
            <method>
                <dataColl>
                    <timeMeth>test timemeth</timeMeth>
                    <dataCollector>test data collector</dataCollector>
                    <sampProc>test sampprocess</sampProc>
                    <deviat>test deviat</deviat>
                    <collMode>test collMode</collMode>
                    <sources/>
                </dataColl>
            </method>
            <dataAccs>
                <setAvail>
                    <accsPlac>Test accsPlac</accsPlac>
                </setAvail>
                <useStmt>
                    <restrctn>NONE</restrctn>
                </useStmt>
            </dataAccs>
            <othrStdyMat>
                <relPubl>122</relPubl>
                <relPubl>12332</relPubl>
            </othrStdyMat>
        </stdyDscr>
    </codeBook>
</metadata>

I wrote the following code to try and process it:

from lxml import etree
import pdb

f = open('/vagrant/out2.xml', 'r')
xml_str = f.read()

xml_doc = etree.fromstring(xml_str)

f.close()

From what I understand from the lxml xpath docs, I should be able to get the text from a specific element as follows:

xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')

However, when I run this it returns an empty array.

The only xpath I can get to return something is using a wildcard:

xml_doc.xpath('*')

Which returns [<Element {ddi:codebook:2_5}codeBook at 0x7f8da8a413f8>].

I've read through the docs and I'm not understanding what is going wrong with this. Any help is appreciated.

Upvotes: 2

Views: 507

Answers (1)

Martin Honnen
Martin Honnen

Reputation: 167716

You need to take the default namespace into account so instead of

xml_doc.xpath('/metadata/codeBook/docDscr/citation/titlStmt/titl/text()')

use

xml_doc.xpath.xpath(
    '/oai:metadata/ddi:codeBook/ddi:docDscr/ddi:citation/ddi:titlStmt/ddi:titl/text()',
    namespaces={
        'oai': 'http://www.openarchives.org/OAI/2.0/', 
        'ddi': 'ddi:codebook:2_5'
    }
)

Upvotes: 3

Related Questions