Reputation: 51
I can receive one of 82 XML structures, each of which contains a root which is not in a name space, and also contains several xmlns attributes the first of which defines a urn for the schema for the object, and the rest (which define namespaces) also contain the urns for the common objects.
The Schema Aware Parsing in Java assumes you know the schema before you start the parsing, but I do not know it until either I have loaded the XML without validation and extracted the root, at point I can load it again with the right schema, or I can find some way to get to the xmlns elements in the root and select the right schema (I know how to map the urn to the correct schema, and all the schemas are held as resources in my classpath.
It seems a shame to load the XML twice, is there a way to do this in a single pass?
As an example I have a possible document which looks like:-
<?xml version="1.0" encoding="UTF-8"?>
<BusinessCard xmlns="urn:oasis:names:specification:ubl:schema:xsd:BusinessCard-2"
xmlns:cac="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2"
xmlns:cbc="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2">
</BusinessCard>
(there is obviously content inside the BusinessCard object, but I left it out as it is no relevance here)
and I the schema for this is in resource "xsd/main/UBL-BusinessCard-2.2.xsd".
I have tried using an EntityResolver, but it does not get called before the parser complains that it can not find the declaration of BusinessCard.
Upvotes: 0
Views: 145
Reputation: 163262
I'm not sure why you say the root isn't in a namespace, when the xmlns="urn:oasis:names:...
declaration makes it clear that it is.
One way to do this is to load a single composite schema that contains all the different component schemas, and validate against that. If the union of the schemas is a valid schema (i.e. no conflicting type definitions) then this might be the best approach, especially if you are validating thousands of document and most of the component schemas are going to be used in each run.
On the other hand, if you're only using a small number of the component schemas in a given run, then this would be expensive.
One approach would be to detect the namespace using an abortive parse of the document. Write a SAX filter that captures the first namespace declaration and then aborts the parse by throwing an exception. Or you could also do this with a streaming XSLT 3.0 transformation.
Even smarter would be to write a little SAX pipeline that does some buffering. Capture the first startElement event, extract the namespace, load the schema, create a validator, feed it the SAX events that you've already consumed (the first startElement), then feed the rest of the SAX events from your preprocessor straight through to the validator.
Upvotes: 1