Reputation: 1054
When I try to convert cas to xmi, I'm receiving UIMARuntimeException
due to �" (an invalid XML character). Thanks in advance.
Exception:
Caused by: org.xml.sax.SAXParseException; lineNumber: 190920; columnNumber: 36557; Character reference "�" is an invalid XML character.
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.uima.util.XmlCasDeserializer.deserializeR(XmlCasDeserializer.java:111)
at org.apache.uima.util.CasIOUtils.load(CasIOUtils.java:366)
Code:
private static void serialize(CAS cas, File file) throws SAXException, IOException {
Watch casToXmi = new Watch(Path.getFileName() + "Cas to Xmi Convertion - "+file.getName());
casToXmi.start();
OutputStream outputStream = null;
try {
outputStream = new BufferedOutputStream(new FileOutputStream(file));
XmiCasSerializer xmiSerializer = new XmiCasSerializer(cas.getTypeSystem());
XMLSerializer xmlSerializer = new XMLSerializer(outputStream, true);
xmiSerializer.serialize(cas,xmlSerializer.getContentHandler());
} catch (FileNotFoundException fnfe) {
throw new FileNotFoundException(fnfe.getMessage());
} catch (SAXException saxe) {
throw new SAXException(saxe.getMessage());
} finally {
try {
outputStream.close();
} catch (IOException ioe) {
throw new IOException(ioe.getMessage());
}
}
casToXmi.stop();
}
Upvotes: 1
Views: 565
Reputation: 1054
I used SerialFormat.BINARY which will give plain custom binary serialized CAS without type system, no filtering.
private static void serialize(CAS cas, File file) throws SAXException, IOException {
Watch casToXmi = new Watch(Path.getFileName() + "Cas to Xmi Convertion - "+file.getName());
casToXmi.start();
OutputStream outputStream = null;
try {
outputStream = new FileOutputStream(file);
CasIOUtils.save(cas, outputStream, SerialFormat.BINARY);
} catch (FileNotFoundException fnfe) {
throw new FileNotFoundException(fnfe.getMessage());
} finally {
try {
outputStream.close();
} catch (IOException ioe) {
throw new IOException(ioe.getMessage());
}
}
casToXmi.stop();
}
Upvotes: 0
Reputation: 10915
Per default, the XMI is serialized as XML 1.0. XML 1.0 has a restricted range of characters that it can represent.
But UIMA has the CasIOUtils which make it really easy to write our data out:
out = new FileOutputStream(this.outputFile);
CasIOUtils.save(cas, out, SerialFormat.XMI_1_1);
Alternatively, you can configure the serializer in your code to produce XML 1.1 instead which might resolve your issue:
XMLSerializer sax2xml = new XMLSerializer(docOS, prettyPrint);
sax2xml.setOutputProperty(OutputKeys.VERSION, "1.1");
These lines were taken from the XmiWriter of DKPro Core.
Note: I see your code includes a Watch
. If speed is your concern, then there are other supported formats which save/load considerably faster than XMI, e.g. the binary format SerialFormat.COMPRESSED_FILTERED_TSI
. Unlike XMI This format also supports any characters in the text.
Disclaimer: I am part of the Apache UIMA project and the maintainer of DKPro Core.
Upvotes: 1