flonk
flonk

Reputation: 43

How to validate an XSD schema with lxml, but ignore elements that match a given pattern?

One can use lxml to validate XML files against a given XSD schema.

Is there a way to apply this validation in a less strict sense, ignoring all elements which contain special expressions?

Consider the following example: Say, I have an xml_file:

<foo>99</foo>
<foo>{{var1}}</foo>
<foo>{{var2}}</foo>
<foo>999</foo>

Now, I run a program on this file, which replacing the {{...}}-expressions and produces a new file:

xml_file_new:

<foo>99</foo>
<foo>23</foo>
<foo>42</foo>
<foo>999</foo>

So far, I can use lxml to validate the new XML file as follows:

from lxml import etree
xml_root = etree.parse(xml_file_new)
xsd_root = etree.parse(xsd_file)
schema = etree.XMLSchema(xsd_root)
schema.validate(xml_root)

The relevant point in my example is that the schema restricts the <foo> contents to integers.

It would not be possible to apply the schema on the old xml_file in advance, however, as my program does some other expensive tasks, I would like to do exactly that while ignoring all lines containing any {{...}}-expressions:

<foo>99</foo>       <!-- should be checked-->
<foo>{{var1}}</foo> <!-- should be ignored -->
<foo>{{var2}}</foo> <!-- should be ignored -->
<foo>999</foo>      <!-- should be checked-->

EDIT: Possible solution approach: One idea would be to define two schemas

However, to avoid the redundant task of keeping two schemas synchronized, one would need a way to generate the relaxed from the strict schema automatically. This sounds quite promising, as both schemas have the same structure, only differing in the restrictions of certain element contents. Is there a simple concept offered by XSD which allows to just "inherit" from one schema and then "attach" additional relaxations to individual elements?

Upvotes: 4

Views: 8487

Answers (2)

Meyer
Meyer

Reputation: 1712

To answer the edited question, it is possible to compose schemas with the xs:include (and xs:import) mechanism. This way, you can declare common parts in a common schema for reuse, and use dedicated schemas for specialized type definitions, like so:

The common schema that describes the structure. Note that it uses FooType, but does not define it:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <!-- Example structure -->
  <xs:element name="root">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="foo" type="FooType" maxOccurs="unbounded"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

</xs:schema>

The relaxed schema to validate before the replacement. It includes the compontents from the common schema, and defines a relaxed FooType:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xs:include schemaLocation="common.xsd"/>

  <xs:simpleType name="FooType">
    <xs:union memberTypes="xs:integer">
      <xs:simpleType>
        <xs:restriction base="xs:string">
          <xs:pattern value="\{\{.*\}\}"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:union>
  </xs:simpleType>

</xs:schema>

The strict schema to validate after the replacement. It defines the strict version of FooType:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xs:include schemaLocation="common.xsd"/>

  <xs:simpleType name="FooType">
     <xs:restriction base="xs:integer"/>
  </xs:simpleType>

</xs:schema>

For completions sake, there also are alternative ways to do this, for example with xs:redefine (XSD 1.0) or xs:override (XSD 1.1). But these have more complex semantics and personally, I try to avoid them.

Upvotes: 3

Meyer
Meyer

Reputation: 1712

Just with plain XSD, I do not know any way to avoid a redundant declaration of the integer type. However, as an alternative, you could adjust the schema within Python.

A simple way is this, using only one schema document (relaxed as default):

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xs:element name="root">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="foo" type="FooType" maxOccurs="unbounded"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>

  <xs:simpleType name="FooType">
    <xs:union memberTypes="xs:integer">
      <xs:simpleType id="RELAXED">
        <xs:restriction base="xs:string">
          <xs:pattern value="\{\{.*\}\}"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:union>
  </xs:simpleType>

</xs:schema>

In Python, you can then simply remove the element with id="RELAXED" to create the strict schema:

from lxml import etree

xsd_tree = etree.parse("relaxed.xsd")
xml_tree = etree.parse("test.xml")

# Create default relaxed schema
relaxed_schema = etree.XMLSchema(xsd_tree)

# Remove RELAXED element to create strict schema
pattern = xsd_tree.find(".//*[@id='RELAXED']")
pattern.getparent().remove(pattern)
strict_schema = etree.XMLSchema(xsd_tree)

print("Relaxed:", relaxed_schema.validate(xml_tree))
print("Strict:", strict_schema.validate(xml_tree))

Of course, with Python you could do this in many different ways. For example, you could also dynamically generate a xs:union element and insert it into a strict version of the schema. But that will get a lot more complex.

Upvotes: 0

Related Questions