radai
radai

Reputation: 24192

how to "canonicalize" arbitrary xml (by reordering all attributes and elements)

i have some code that generates an *.xsd file from a set of jaxb-annotated classes:

JAXBContext context = //build from set of classes
final DOMResult result = new DOMResult(); //will hold xsd output
context.generateSchema(new SchemaOutputResolver() {
    @Override
    public Result createOutput(String namespaceUri, String suggestedFileName) throws IOException {
       return result;
    }
});
Document doc = result.getNode();
OutputFormat format = new OutputFormat(doc);
format.setIndenting(true);
StringWriter writer = new StringWriter();
XMLSerializer serializer = new XMLSerializer(writer, format);
serializer.serialize(doc);
String xsd = writer.toString();

the problem is that the xsd produces (stored in xsd) is in random order - 2 runs with the same input might generate logically-identical xsds but in different element order, which plays havoc with diff tools when its written out to file.

how to i "canonicalize" the xml inside xsd?

i've seen some other references to xslt in related questions but anything i saw required listing the elements in advance. im looking for something that works on any xml input.

Upvotes: 4

Views: 539

Answers (1)

C. M. Sperberg-McQueen
C. M. Sperberg-McQueen

Reputation: 25034

There is no public spec I'm aware of that attempts to specify a canonical form for XSD schema documents. So there won't be off-the-shelf tools; you must either roll your own or decide (as Mathias Müller suggests) that diff is not your friend here.

Note that off-the-shelf canonicalization tools may normalize the order of attribute-value specifications in the input document, but they will never attempt to normalize the sequence of elements, since in the general case sequence of elements is significant in XML.

When I've been in this situation, I've specified a 'canonical' form that would minimize headaches for me (list all top-level elements in alpha order, then all top-level complex types in alpha order, then all top-level simple types in alpha order, ...) and written an XSLT stylesheet to sort the elements appropriately.

If that suffices for your purposes (that is, if it's the sequence of top-level constructs that's causing your problems), it's easy enough to do (assuming you have enough knowledge of XSLT to write a near-identity transform that sorts the top-level declarations, or can write an equivalent transformation in some other technology).

If the schema generation is also inconsistent regarding the structure of the individual declarations, then you may also need to normalize declaration structure (sort the children of xsd:choice alphabetically, sort attribute references and declarations alphabetically or by type or however you like, normalize model group structures, ...). Depending on how exuberantly your schema generator varies its output, this can in theory become arbitrarily complicated. But in practice, I expect that the problem will be tractable for anyone with adequate knowledge of XSD and XSLT (or some other XML processing technology).

You will also, of course, have to take steps to get the line breaks and whitespace in the schema documents under control; the XSLT serialization controls for indenting output should help you here.

Upvotes: 2

Related Questions