Reputation: 3102
I have a XML file and its XSD schema. I am able to validate the XML file and implement a custom org.xml.sax.ErrorHandler like the following:
class MyErrorHandler implements ErrorHandler{
...
@Override
public void warning(SAXParseException exception) throws SAXException {
System.out.println("Line: " + exception.getLineNumber() + ") " + exception.getMessage() + exception);
warnings++;
}
...
}
Is it possible to actually manipulate the element causing the exception, for example by removing it from the XML file?
Two notes:
Also just a suggestion on which direction to follow in order to solve the problem is appreciated. Thanks!
Upvotes: 0
Views: 2255
Reputation: 111706
Automatic repair of an XML document is not possible in the general case.
In only very limited contexts would the repair necessary to make an XML document valid be automatically discernable from any given validation error. There is not a one-to-one mapping between validation errors and ways of remedying them.
Consider element r
with a
through e
children:
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="r">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="a"/>
<xsd:element name="b"/>
<xsd:element name="c"/>
<xsd:element name="d"/>
<xsd:element name="e"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
An XML document such as this one,
<r>
<a/>
<x/>
<b/>
<c/>
<d/>
<e/>
</r>
would yield a validation message such as the following by Xerces-J:
[Error] try.xml:5:7: cvc-complex-type.2.4.a: Invalid content was found starting with element 'x'. One of '{b}' is expected.
You might here automatically remove x
, and all would be fine. (Or, you might insert a b
, which would not be fine.)
However, for the same XSD, consider that this XML document,
<r>
<a/>
<c/>
<d/>
<e/>
</r>
would yield a validation message such as the following by Xerces-J:
[Error] try.xml:5:7: cvc-complex-type.2.4.a: Invalid content was found starting with element 'c'. One of '{b}' is expected.
If you automatically removed c
, your document would still be invalid, and you'd receive a similar message about d
being unexpected. This would continue until your document looked like this,
<r>
<a/>
</r>
at which point your error message will have returned to the original,
[Error] try.xml:5:5: cvc-complex-type.2.4.b: The content of element 'r' is not complete. One of '{b}' is expected.
As you can see, there's simply not enough information available in a given validation error to know how to repair the XML document in general.
You could do better by consulting the XSD, but this is extremely complex and still not guaranteed to uniquely determine the exact mistake made by the authoring person or system. Automatic repair of an XML document, even given an XSD, is not possible in the general case.
Upvotes: 4
Reputation: 163539
Everything kjhughes says is correct.
However, if there are particular patterns of validation errors in your input, then it's possible to create rules that fix those.
In many cases it's probably simplest to do this by writing XSLT code that detects the incorrect pattern and fixes it without even applying schema validation. For example, if you have a perennial problem with EEE elements where the child XXX element is supposed to precede child YYY but they are often in the wrong order, then you can repair that with a template rule
<xsl:template match="EEE[XXX >> YYY]">
<xsl:copy>
<xsl:copy-of select="XXX/preceding-sibling::*, XXX, YYY, YYY/following-sibling::*"/>
</xsl:copy>
</xsl:template>
The theory in XML Schema is that when you validate a document, the output is not just a yes/no answer, nor even a set of error messages, but rather a document in which individual nodes are marked as valid or invalid, and if invalid, with the error conditions that cause them to be considered invalid. The theory is that you can then explore this document, find the invalidities, and handle them in the appropriate way. However, I don't think there are many tools that implement this, at least not in full.
Recent releases of Saxon's schema processor introduce the InvalidityHandler, which is called to provide complete information about each validity error, and an implementation of this interface, which produces a report of validation errors in XML format. This is designed to enable the construction of tools that do more with the error information than simply putting it in front of the user to ponder. There's certainly a class of validation errors where it would be possible to take the error report and generate XSLT code to correct the error, for example if the input is a set of transactions to be processed then you could create a transaction file that omits those transactions that failed validation.
(Having said that, for this particular use case it might be better to write an XSLT or XQuery application that validates transactions one by one, and uses try/catch to copy only the valid transactions.)
Upvotes: 0