Dmitry Bubnenkov
Dmitry Bubnenkov

Reputation: 9869

xpath work on one file, but do not work with another

I need to extract id value from XML. I wrote next code. It work on simple example. But return None on real XML. Code:

from lxml import etree

parser = etree.XMLParser(ns_clean=True)
tree = etree.parse('real.xml', parser)
#tree = etree.parse('test.xml', parser)

#print(dir(tree.find("//id")))
print(tree.find("//id").text)

test.xml:

<aa>
<dd></dd>
<foo>
<bar>qqq</bar>
<id>123</id>
</foo>
</aa>

real.xml:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns2:export xmlns:ns5="http://zakupki.gov.ru/oos/CPtypes/1" xmlns="http://zakupki.gov.ru/oos/types/1" xmlns:ns6="http://zakupki.gov.ru/oos/pprf615types/1" xmlns:ns7="http://zakupki.gov.ru/oos/EPtypes/1" xmlns:ns8="http://zakupki.gov.ru/oos/printform/1" xmlns:ns9="http://zakupki.gov.ru/oos/control99/1" xmlns:ns2="http://zakupki.gov.ru/oos/export/1" xmlns:ns3="http://zakupki.gov.ru/oos/common/1" xmlns:ns4="http://zakupki.gov.ru/oos/base/1">
    <ns2:fcsNotification111 schemeVersion="9.0">
        <id>18934116</id>
        <purchaseNumber>0373100043519000001</purchaseNumber>
        <docPublishDate>2019-01-11T11:06:05.465+03:00</docPublishDate>
        <docNumber>№0373100043519000001</docNumber>
        <href>http://zakupki.gov.ru/epz/order/notice/inm111/view/common-info.html?regNumber=0373100043519000001</href>
        <printForm>
            <url>http://zakupki.gov.ru/epz/order/notice/printForm/viewXml.html?noticeId=18934116</url>
            <signature type="CAdES-BES"></signature>
        </printForm>
        <purchaseObjectInfo>Теплоснабжение</purchaseObjectInfo>
        <purchaseResponsible>
            <responsibleOrg>
                <regNum>03731000435</regNum>
                <consRegistryNum>001Ч1823</consRegistryNum>
                <fullName>ФЕДЕРАЛЬНОЕ ГОСУДАРСТВЕННОЕ БЮДЖЕТНОЕ УЧРЕЖДЕНИЕ НАУКИ ИНСТИТУТ ВОДНЫХ ПРОБЛЕМ РОССИЙСКОЙ АКАДЕМИИ НАУК</fullName>
                <postAddress>Российская Федерация, 119333, Москва, УЛ ГУБКИНА, 3</postAddress>
                <factAddress>Российская Федерация, 119333, Москва, УЛ ГУБКИНА, 3</factAddress>
                <INN>7701003690</INN>
                <KPP>773601001</KPP>
            </responsibleOrg>
            <responsibleRole>CU</responsibleRole>
        </purchaseResponsible>
        <placingWay>
            <code>EP111</code>
            <name>Закупка у единственного поставщика (подрядчика, исполнителя) с учетом положений ст. 111 Закона № 44-ФЗ</name>
        </placingWay>
        <lots>
            <lot>
                <lotNumber>1</lotNumber>
                <maxPrice>400000</maxPrice>
                <currency>
                    <code>RUB</code>
                    <name>Российский рубль</name>
                </currency>
                <OKPD2>
                    <code>35.30.11.111</code>
                </OKPD2>
                <purchaseCode>191770100369077360100100100013530000</purchaseCode>
                <tenderPlanInfo>
                    <plan2017Number>2019037310004350010001</plan2017Number>
                    <position2017Number>2019037310004350010000300001</position2017Number>
                </tenderPlanInfo>
                <mustPublicDiscussion>false</mustPublicDiscussion>
            </lot>
        </lots>
        <particularsActProcurement>п.8, ч.1, ст.93 44ФЗ</particularsActProcurement>
    </ns2:fcsNotification111>
</ns2:export>

Upvotes: 0

Views: 73

Answers (2)

Daniel Haley
Daniel Haley

Reputation: 52878

To add to Mads' answer, instead of using local-name() (which won't work with .find(); only .xpath()) you can bind the namespace to a prefix and use that in your XPath...

from lxml import etree

parser = etree.XMLParser(ns_clean=True)
tree = etree.parse("real.xml", parser)
# tree = etree.parse('test.xml', parser)

ns = {"t1": "http://zakupki.gov.ru/oos/types/1"}

print(tree.find("//t1:id", namespaces=ns).text)

More info: https://lxml.de/tutorial.html#namespaces

Also, ns_clean=True in your XMLParser only cleans up redundant namespace declarations; it doesn't remove namespaces.

Upvotes: 1

Mads Hansen
Mads Hansen

Reputation: 66783

It is easy to overlook, but in the second document the <id> element is bound to the namespace http://zakupki.gov.ru/oos/types/1.

If you look at the first element, you will see that namespace is declared without a prefix: xmlns="http://zakupki.gov.ru/oos/types/1"

If you want to select any element with a local-name() of id regardless of what the namespace is, you could change your XPath to:

//*[local-name() = 'id']

and with XPath 2.0 or greater, you could use a wildcard for the namespace:

//*:id

Upvotes: 1

Related Questions