Reputation: 1748
I am making a request to a URL and in the xml response I get, the xmlns attribute namespace changes from time to time. Hence finding an element returns None when I hardcode the namespace.
For instance I get the following XML:
<package xmlns="http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd">
<metadata>
<id>SharpZipLib</id>
<version>1.1.0</version>
<authors>ICSharpCode</authors>
<owners>ICSharpCode</owners>
<requireLicenseAcceptance>false</requireLicenseAcceptance>
<licenseUrl>https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt</licenseUrl>
<projectUrl>https://github.com/icsharpcode/SharpZipLib</projectUrl>
<description>SharpZipLib (#ziplib, formerly NZipLib) is a compression library for Zip, GZip, BZip2, and Tar written entirely in C# for .NET. It is implemented as an assembly (installable in the GAC), and thus can easily be incorporated into other projects (in any .NET language)</description>
<releaseNotes>Please see https://github.com/icsharpcode/SharpZipLib/wiki/Release-1.1 for more information.</releaseNotes>
<copyright>Copyright © 2000-2018 SharpZipLib Contributors</copyright>
<tags>Compression Library Zip GZip BZip2 LZW Tar</tags>
<repository type="git" url="https://github.com/icsharpcode/SharpZipLib" commit="45347c34a0752f188ae742e9e295a22de6b2c2ed"/>
<dependencies>
<group targetFramework=".NETFramework4.5"/>
<group targetFramework=".NETStandard2.0"/>
</dependencies>
</metadata>
</package>
Now see the xmlns attribute. The entire attribute is same but sometimes the '2012/06' part keeps changing from time to time for certain responses. I have the following python script. See the line ns = {'nuspec': 'http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd'}
. I can't hardcode the namespace like that. Are there any alternatives like using regular expressions etc to map the namespace? Only the date part changes i.e. 2013/05 in some responses its 2012/04 etc.
def fetch_nuget_spec(self, versioned_package):
name = versioned_package.package.name.lower()
version = versioned_package.version.lower()
url = f'https://api.nuget.org/v3-flatcontainer/{name}/{version}/{name}.nuspec'
response = requests.get(url)
metadata = ET.fromstring(response.content)
ns = {'nuspec': 'http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd'}
license = metadata.find('./nuspec:metadata/nuspec:license', ns)
if license is None:
license_url=metadata.find('./nuspec:metadata/nuspec:licenseUrl', ns)
if license_url is None:
return { 'license': 'Not Found' }
return {'license':license_url.text}
else:
if len(license.text)==0:
print('SHIT')
return { 'license': license.text }
Upvotes: 0
Views: 92
Reputation: 3501
Without another modul, all with xml.etree.ElementTree
:
import xml.etree.ElementTree as ET
tree = ET.parse('xml_str.xml')
root = tree.getroot()
ns = dict([node for _, node in ET.iterparse('xml_str.xml', events=['start-ns'])])
print(ns)
licenseUrl = root.find(".//licenseUrl", ns).text
print("LicenseUrl: ", licenseUrl)
Output:
{'': 'http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd'}
LicenseUrl: https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt
Option 2, if parsing time is important:
import xml.etree.ElementTree as ET
nsmap = {}
for event, node in ET.iterparse('xml_str.xml', events=['start-ns', 'end']):
if event == 'start-ns':
ns, url = node
nsmap[ns] = url
print(nsmap)
if event == 'end' and node.tag == f"{{{url}}}licenseUrl":
print(node.text)
Output:
{'': 'http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd'}
https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt
Upvotes: 1
Reputation: 163468
You need to be aware that the reason they put the date in the namespace URI is that the format of the XML can change from one version to another, so if you're going to write code that works with any version, you need to make sure it is tested properly against multiple versions. (Generally people advise against versioning namespace URIs, for exactly the reasons you are seeing, but not everyone follows that advice, and that appears to include Microsoft).
My own preference when trying to handle multiple versions of an input document format is to insert a normalisation step into your processing pipeline: this should transform the incoming documents into a common format so that the rest of your processing doesn't need to worry about the variations. As well as changing the namespaces, this phase could handle any other differences you encounter in the formats.
My other preference is to do as much of the processing as possible in XSLT, and an XSLT step that normalizes the namespace is pretty easy to write, especially if you use XSLT 3.0.
Please don't follow the advice of processing XML using regular expressions. It can only lead to tears. For example if someone posts a nuspec
document containing an older namespace commented out, it's very likely to throw your processing completely.
Upvotes: 1
Reputation: 12777
If using lxml
is an option then it could help to list namespaces like
from lxml import etree
doc = etree.parse("tmp.xml")
# get namespaces excluding the default 'xml'
ns = { ('nuspec' if t[0] is None else t[0]): t[1] for t in doc.xpath('/*/namespace::*[name()!="xml"]')}
print(ns)
# {'nuspec': 'http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd'}
Using both lxml
and xml.etree.ElementTree
could mean that the document would be parsed twice so only lxml should be used if possible since it has a more complete XML and XPath implementation.
If that's not possible, ET could be used from the result of lxml parsing
>>> tree = ET.ElementTree(doc)
>>> tree.find('./nuspec:metadata/nuspec:licenseUrl', ns)
<Element {http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd}licenseUrl at 0x7fe019ea1cc8>
xml.etree.ElementTree implementation lacks namespace
axis support.
Upvotes: 2
Reputation: 3501
Don’t hardcode the namespace. With regex you can find it with:
import xml.etree.ElementTree as ET
import re
xml = """<?xml version="1.0" encoding="UTF-8"?>
<package xmlns="http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd">
<metadata>
<id>SharpZipLib</id>
<version>1.1.0</version>
<authors>ICSharpCode</authors>
<owners>ICSharpCode</owners>
<requireLicenseAcceptance>false</requireLicenseAcceptance>
<licenseUrl>https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt</licenseUrl>
<projectUrl>https://github.com/icsharpcode/SharpZipLib</projectUrl>
<description>SharpZipLib (#ziplib, formerly NZipLib) is a compression library for Zip, GZip, BZip2, and Tar written entirely in C# for .NET. It is implemented as an assembly (installable in the GAC), and thus can easily be incorporated into other projects (in any .NET language)</description>
<releaseNotes>Please see https://github.com/icsharpcode/SharpZipLib/wiki/Release-1.1 for more information.</releaseNotes>
<copyright>Copyright © 2000-2018 SharpZipLib Contributors</copyright>
<tags>Compression Library Zip GZip BZip2 LZW Tar</tags>
<repository type="git" url="https:h//github.com/icsharpcode/SharpZipLib" commit="45347c34a0752f188ae742e9e295a22de6b2c2ed"/>
<dependencies>
<group targetFramework=".NETFramework4.5"/>
<group targetFramework=".NETStandard2.0"/>
</dependencies>
</metadata>
</package>"""
root = ET.fromstring(xml)
# Find namespace with regex
ns = re.match(r'{.*}', root.tag).group(0)
print("Namespace: ", ns)
licenseUrl = root.find(f".//{ns}licenseUrl").text
print("LicenseUrl: ", licenseUrl)
Output:
Namespace: {http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd}
LicenseUrl: https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt
Upvotes: 1