nikhil
nikhil

Reputation: 1748

Parse XML with namespace attribute changing in Python

I am making a request to a URL and in the xml response I get, the xmlns attribute namespace changes from time to time. Hence finding an element returns None when I hardcode the namespace.

For instance I get the following XML:

<package xmlns="http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd">
<metadata>
<id>SharpZipLib</id>
<version>1.1.0</version>
<authors>ICSharpCode</authors>
<owners>ICSharpCode</owners>
<requireLicenseAcceptance>false</requireLicenseAcceptance>
<licenseUrl>https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt</licenseUrl>
<projectUrl>https://github.com/icsharpcode/SharpZipLib</projectUrl>
<description>SharpZipLib (#ziplib, formerly NZipLib) is a compression library for Zip, GZip, BZip2, and Tar written entirely in C# for .NET. It is implemented as an assembly (installable in the GAC), and thus can easily be incorporated into other projects (in any .NET language)</description>
<releaseNotes>Please see https://github.com/icsharpcode/SharpZipLib/wiki/Release-1.1 for more information.</releaseNotes>
<copyright>Copyright © 2000-2018 SharpZipLib Contributors</copyright>
<tags>Compression Library Zip GZip BZip2 LZW Tar</tags>
<repository type="git" url="https://github.com/icsharpcode/SharpZipLib" commit="45347c34a0752f188ae742e9e295a22de6b2c2ed"/>
<dependencies>
<group targetFramework=".NETFramework4.5"/>
<group targetFramework=".NETStandard2.0"/>
</dependencies>
</metadata>
</package>

Now see the xmlns attribute. The entire attribute is same but sometimes the '2012/06' part keeps changing from time to time for certain responses. I have the following python script. See the line ns = {'nuspec': 'http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd'}. I can't hardcode the namespace like that. Are there any alternatives like using regular expressions etc to map the namespace? Only the date part changes i.e. 2013/05 in some responses its 2012/04 etc.

def fetch_nuget_spec(self, versioned_package):
        name = versioned_package.package.name.lower()
        version = versioned_package.version.lower()
        url = f'https://api.nuget.org/v3-flatcontainer/{name}/{version}/{name}.nuspec'
        response = requests.get(url)
        metadata = ET.fromstring(response.content)
        ns = {'nuspec': 'http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd'}
        license = metadata.find('./nuspec:metadata/nuspec:license', ns)
        if license is None:
            license_url=metadata.find('./nuspec:metadata/nuspec:licenseUrl', ns)
            if license_url is None:
                return { 'license': 'Not Found'  }
            return {'license':license_url.text}
        else:
            if len(license.text)==0:
                print('SHIT')
            return { 'license': license.text  }

  

Upvotes: 0

Views: 92

Answers (4)

Hermann12
Hermann12

Reputation: 3501

Without another modul, all with xml.etree.ElementTree:

import xml.etree.ElementTree as ET

tree = ET.parse('xml_str.xml')
root = tree.getroot()

ns = dict([node for _, node in ET.iterparse('xml_str.xml', events=['start-ns'])])
print(ns)

licenseUrl = root.find(".//licenseUrl", ns).text
print("LicenseUrl: ", licenseUrl)

Output:

{'': 'http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd'}
LicenseUrl:  https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt

Option 2, if parsing time is important:


import xml.etree.ElementTree as ET

nsmap = {}
for event, node in ET.iterparse('xml_str.xml', events=['start-ns', 'end']):
    
    if event == 'start-ns':
        ns, url = node
        nsmap[ns] = url
        print(nsmap)

    if event == 'end' and node.tag == f"{{{url}}}licenseUrl":
        print(node.text)

Output:


{'': 'http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd'}
https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt

Upvotes: 1

Michael Kay
Michael Kay

Reputation: 163468

You need to be aware that the reason they put the date in the namespace URI is that the format of the XML can change from one version to another, so if you're going to write code that works with any version, you need to make sure it is tested properly against multiple versions. (Generally people advise against versioning namespace URIs, for exactly the reasons you are seeing, but not everyone follows that advice, and that appears to include Microsoft).

My own preference when trying to handle multiple versions of an input document format is to insert a normalisation step into your processing pipeline: this should transform the incoming documents into a common format so that the rest of your processing doesn't need to worry about the variations. As well as changing the namespaces, this phase could handle any other differences you encounter in the formats.

My other preference is to do as much of the processing as possible in XSLT, and an XSLT step that normalizes the namespace is pretty easy to write, especially if you use XSLT 3.0.

Please don't follow the advice of processing XML using regular expressions. It can only lead to tears. For example if someone posts a nuspec document containing an older namespace commented out, it's very likely to throw your processing completely.

Upvotes: 1

LMC
LMC

Reputation: 12777

If using lxml is an option then it could help to list namespaces like

from lxml import etree
doc = etree.parse("tmp.xml")
# get namespaces excluding the default 'xml'
ns = { ('nuspec' if t[0] is None else t[0]): t[1] for t in doc.xpath('/*/namespace::*[name()!="xml"]')}
print(ns)
# {'nuspec': 'http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd'}

Using both lxml and xml.etree.ElementTree could mean that the document would be parsed twice so only lxml should be used if possible since it has a more complete XML and XPath implementation.
If that's not possible, ET could be used from the result of lxml parsing

>>> tree = ET.ElementTree(doc)
>>> tree.find('./nuspec:metadata/nuspec:licenseUrl', ns)
<Element {http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd}licenseUrl at 0x7fe019ea1cc8>

xml.etree.ElementTree implementation lacks namespace axis support.

Upvotes: 2

Hermann12
Hermann12

Reputation: 3501

Don’t hardcode the namespace. With regex you can find it with:

import xml.etree.ElementTree as ET
import re

xml = """<?xml version="1.0" encoding="UTF-8"?>
<package xmlns="http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd">
<metadata>
<id>SharpZipLib</id>
<version>1.1.0</version>
<authors>ICSharpCode</authors>
<owners>ICSharpCode</owners>
<requireLicenseAcceptance>false</requireLicenseAcceptance>
<licenseUrl>https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt</licenseUrl>
<projectUrl>https://github.com/icsharpcode/SharpZipLib</projectUrl>
<description>SharpZipLib (#ziplib, formerly NZipLib) is a compression library for Zip, GZip, BZip2, and Tar written entirely in C# for .NET. It is implemented as an assembly (installable in the GAC), and thus can easily be incorporated into other projects (in any .NET language)</description>
<releaseNotes>Please see https://github.com/icsharpcode/SharpZipLib/wiki/Release-1.1 for more information.</releaseNotes>
<copyright>Copyright © 2000-2018 SharpZipLib Contributors</copyright>
<tags>Compression Library Zip GZip BZip2 LZW Tar</tags>
<repository type="git" url="https:h//github.com/icsharpcode/SharpZipLib" commit="45347c34a0752f188ae742e9e295a22de6b2c2ed"/>
<dependencies>
<group targetFramework=".NETFramework4.5"/>
<group targetFramework=".NETStandard2.0"/>
</dependencies>
</metadata>
</package>"""

root = ET.fromstring(xml)

# Find namespace with regex
ns = re.match(r'{.*}', root.tag).group(0)
print("Namespace: ", ns)

licenseUrl = root.find(f".//{ns}licenseUrl").text
print("LicenseUrl: ", licenseUrl)

Output:

Namespace:  {http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd}
LicenseUrl:  https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt

Upvotes: 1

Related Questions