Reputation: 963
I am using lxml to parse an xsd file and am looking for an easy way to remove the URL namespace attached to each element name. Here's the xsd file:
<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" version="2.0" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="rootelement">
<xs:complexType>
<xs:choice maxOccurs="unbounded">
<xs:element minOccurs="1" maxOccurs="1" name="element1">
<xs:complexType>
<xs:all>
<xs:element name="subelement1" type="xs:string" />
<xs:element name="subelement2" type="xs:integer" />
<xs:element name="subelement3" type="xs:dateTime" />
</xs:all>
<xs:attribute name="id" type="xs:integer" use="required" />
</xs:complexType>
</xs:element>
</xs:choice>
<xs:attribute fixed="2.0" name="version" type="xs:decimal" use="required" />
</xs:complexType>
</xs:element>
</xs:schema>
and using this code:
from lxml import etree
parser = etree.XMLParser()
data = etree.parse(open("testschema.xsd"),parser)
root = data.getroot()
rootelement = root.getchildren()[0]
rootelementattribute = rootelement.getchildren()[0].getchildren()[1]
print "root element tags"
print rootelement[0].tag
print rootelementattribute.tag
elements = rootelement.getchildren()[0].getchildren()[0].getchildren()
elements_attribute = elements[0].getchildren()[0].getchildren()[1]
print "element tags"
print elements[0].tag
print elements_attribute.tag
subelements = elements[0].getchildren()[0].getchildren()[0].getchildren()
print "subelements"
print subelements
I get the following output
root element tags
{http://www.w3.org/2001/XMLSchema}complexType
{http://www.w3.org/2001/XMLSchema}attribute
element tags
{http://www.w3.org/2001/XMLSchema}element
{http://www.w3.org/2001/XMLSchema}attribute
subelements
[<Element {http://www.w3.org/2001/XMLSchema}element at 0x7f2998fb16e0>, <Element {http://www.w3.org/2001/XMLSchema}element at 0x7f2998fb1780>, <Element {http://www.w3.org/2001/XMLSchema}element at 0x7f2998fb17d0>]
I don't want "{http://www.w3.org/2001/XMLSchema}" to appear at all when I pull the tag data (altering the xsd file is not an option). The reason I need the xsd tag info is that I am using this to validate column names from a series of flat files. On the "element" level there are multiple elements that I'm pulling, as well as subelements, which I am using a dictionary to validate columns. Also, any suggestions on improving the code above would be greatly, such as a way to use fewer "getchildren" calls, or just make it more organized.
Upvotes: 2
Views: 5169
Reputation: 2314
I wonder why etree.XMLParser(ns_clean=True)
doesn't work. It had not worked for me so did it getting namespace from root.nsmap between brackets and replacing it with empty string
print rootelement[0].tag.replace('{%s}' %root.nsmap['xs'], '')
Upvotes: 1
Reputation: 28666
I'd use:
print elem.tag.split('}')[-1]
But you could also use the xpath function local-name()
:
print elem.xpath('local-name()')
As for fewer getchildren()
calls: just leave them out. getchildren()
is a deprecated way of making a list of the direct children (you should just use list(elem)
instead if you actually want this).
You can iterate over, or use an index on an element directly. For example: rootelement[0]
will give you the first child element of rootelement
(but more efficient than if you were use rootelement.getchildren()[0]
, because this would act like list(rootelement)
and create a new list first)
Upvotes: 3
Reputation:
If the URI might change in the future (for some unknown reason or you're truly paranoid), consider the following:
print "root element tags"
tag, nsmap, prefix = rootelement[0].tag, rootelement[0].nsmap, rootelement[0].prefix
tag = tag[len(nsmap[prefix]) + 2:]
print tag
This is a very unlikely case, but who knows?
Upvotes: 0
Reputation: 53819
The easiest thing to do is to just use string slicing to remove namespace prefix:
>>> print rootelement[0].tag[34:]
complexType
Upvotes: 0