Reputation: 963

lxml parse xsd file without Schema URL

I am using lxml to parse an xsd file and am looking for an easy way to remove the URL namespace attached to each element name. Here's the xsd file:

<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" version="2.0" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="rootelement">
    <xs:complexType>
      <xs:choice maxOccurs="unbounded">
        <xs:element minOccurs="1" maxOccurs="1" name="element1">
          <xs:complexType>
            <xs:all>
              <xs:element name="subelement1" type="xs:string" />
              <xs:element name="subelement2" type="xs:integer" />
              <xs:element name="subelement3" type="xs:dateTime" />
            </xs:all>
            <xs:attribute name="id" type="xs:integer" use="required" />
          </xs:complexType>
        </xs:element>
       </xs:choice>
      <xs:attribute fixed="2.0" name="version" type="xs:decimal" use="required" />
    </xs:complexType>
  </xs:element>
</xs:schema>

and using this code:

from lxml import etree

parser = etree.XMLParser()
data = etree.parse(open("testschema.xsd"),parser)
root = data.getroot()
rootelement = root.getchildren()[0]
rootelementattribute = rootelement.getchildren()[0].getchildren()[1]
print "root element tags"
print rootelement[0].tag
print rootelementattribute.tag
elements = rootelement.getchildren()[0].getchildren()[0].getchildren()
elements_attribute = elements[0].getchildren()[0].getchildren()[1]
print "element tags"
print elements[0].tag
print elements_attribute.tag
subelements = elements[0].getchildren()[0].getchildren()[0].getchildren()
print "subelements"
print subelements

I get the following output

root element tags
{http://www.w3.org/2001/XMLSchema}complexType
{http://www.w3.org/2001/XMLSchema}attribute
element tags
{http://www.w3.org/2001/XMLSchema}element
{http://www.w3.org/2001/XMLSchema}attribute
subelements
[<Element {http://www.w3.org/2001/XMLSchema}element at 0x7f2998fb16e0>, <Element {http://www.w3.org/2001/XMLSchema}element at 0x7f2998fb1780>, <Element {http://www.w3.org/2001/XMLSchema}element at 0x7f2998fb17d0>]

I don't want "{http://www.w3.org/2001/XMLSchema}" to appear at all when I pull the tag data (altering the xsd file is not an option). The reason I need the xsd tag info is that I am using this to validate column names from a series of flat files. On the "element" level there are multiple elements that I'm pulling, as well as subelements, which I am using a dictionary to validate columns. Also, any suggestions on improving the code above would be greatly, such as a way to use fewer "getchildren" calls, or just make it more organized.

Upvotes: 2

Answers (4)

mdob

Reputation: 2314

I wonder why etree.XMLParser(ns_clean=True) doesn't work. It had not worked for me so did it getting namespace from root.nsmap between brackets and replacing it with empty string

print rootelement[0].tag.replace('{%s}' %root.nsmap['xs'], '')

Upvotes: 1

Steven

Reputation: 28666

I'd use:

print elem.tag.split('}')[-1]

But you could also use the xpath function local-name():

print elem.xpath('local-name()')

As for fewer getchildren() calls: just leave them out. getchildren() is a deprecated way of making a list of the direct children (you should just use list(elem) instead if you actually want this).

You can iterate over, or use an index on an element directly. For example: rootelement[0] will give you the first child element of rootelement (but more efficient than if you were use rootelement.getchildren()[0], because this would act like list(rootelement) and create a new list first)

Upvotes: 3

user539810

Reputation:

If the URI might change in the future (for some unknown reason or you're truly paranoid), consider the following:

print "root element tags"
tag, nsmap, prefix = rootelement[0].tag, rootelement[0].nsmap, rootelement[0].prefix
tag = tag[len(nsmap[prefix]) + 2:]
print tag

This is a very unlikely case, but who knows?

Upvotes: 0

Zach Kelling

Reputation: 53819

The easiest thing to do is to just use string slicing to remove namespace prefix:

>>> print rootelement[0].tag[34:]
complexType

Upvotes: 0

lxml parse xsd file without Schema URL

Answers (4)

Related Questions