Sugunalakshmi Pagemajik
Sugunalakshmi Pagemajik

Reputation: 1054

CAS to XMI -Uima

When I try to convert cas to xmi, I'm receiving UIMARuntimeException due to &#55349" (an invalid XML character). Thanks in advance.

Exception:

Caused by: org.xml.sax.SAXParseException; lineNumber: 190920; columnNumber: 36557; Character reference "&#55349" is an invalid XML character.
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.uima.util.XmlCasDeserializer.deserializeR(XmlCasDeserializer.java:111)
at org.apache.uima.util.CasIOUtils.load(CasIOUtils.java:366)

Code:

    private static void serialize(CAS cas, File file) throws SAXException, IOException {
    Watch casToXmi = new Watch(Path.getFileName() + "Cas to Xmi Convertion - "+file.getName());
    casToXmi.start();
      OutputStream outputStream = null;
      try {
        outputStream = new BufferedOutputStream(new FileOutputStream(file));
        XmiCasSerializer xmiSerializer = new XmiCasSerializer(cas.getTypeSystem());
        XMLSerializer xmlSerializer = new XMLSerializer(outputStream, true);
        xmiSerializer.serialize(cas,xmlSerializer.getContentHandler());
      } catch (FileNotFoundException fnfe) {
        throw new FileNotFoundException(fnfe.getMessage());
      } catch (SAXException saxe) {
        throw new SAXException(saxe.getMessage());
      } finally {
        try {
          outputStream.close();           
        } catch (IOException ioe) {
          throw new IOException(ioe.getMessage());
        }
      }
      casToXmi.stop();
    }   

Upvotes: 1

Views: 565

Answers (2)

Sugunalakshmi Pagemajik
Sugunalakshmi Pagemajik

Reputation: 1054

I used SerialFormat.BINARY which will give plain custom binary serialized CAS without type system, no filtering.

private static void serialize(CAS cas, File file) throws SAXException, IOException {
    Watch casToXmi = new Watch(Path.getFileName() + "Cas to Xmi Convertion - "+file.getName());
    casToXmi.start();
      OutputStream outputStream = null;
      try {
        outputStream = new FileOutputStream(file);
        CasIOUtils.save(cas, outputStream, SerialFormat.BINARY);
        
      } catch (FileNotFoundException fnfe) {
        throw new FileNotFoundException(fnfe.getMessage());
      } finally {
        try {
          outputStream.close();
          
        } catch (IOException ioe) {
          throw new IOException(ioe.getMessage());
        }
      }
      casToXmi.stop();
    }

Upvotes: 0

rec
rec

Reputation: 10915

Per default, the XMI is serialized as XML 1.0. XML 1.0 has a restricted range of characters that it can represent.

But UIMA has the CasIOUtils which make it really easy to write our data out:

  out = new FileOutputStream(this.outputFile);
  CasIOUtils.save(cas, out, SerialFormat.XMI_1_1);

Alternatively, you can configure the serializer in your code to produce XML 1.1 instead which might resolve your issue:

XMLSerializer sax2xml = new XMLSerializer(docOS, prettyPrint);
sax2xml.setOutputProperty(OutputKeys.VERSION, "1.1");

These lines were taken from the XmiWriter of DKPro Core.


Note: I see your code includes a Watch. If speed is your concern, then there are other supported formats which save/load considerably faster than XMI, e.g. the binary format SerialFormat.COMPRESSED_FILTERED_TSI. Unlike XMI This format also supports any characters in the text.

Disclaimer: I am part of the Apache UIMA project and the maintainer of DKPro Core.

Upvotes: 1

Related Questions