Martin Schröder
Martin Schröder

Reputation: 4591

Java: How to properly convert Windows console output to XML?

I'm trying on Windows 7 to capture the console output of one jar (written with System.out) and write it out as an XML file. This works, but I'm having encoding problems (e.g. with an "ë").

I have this code for reading the console output:

final LinkedList<String> texOutput = new LinkedList<String>();
final Process p = Runtime.getRuntime().exec("java -jar " + absoluteNameOfJar, null, tmpDir);
String line;
final BufferedReader output = new BufferedReader(new InputStreamReader(p.getInputStream(), "Cp1252"));
while ( (line = output.readLine()) != null) {
    texOutput.add(line);
}

And here's the code for writing the LinkedList to XML (using jdom)

if (texOutput.size() > 0) {
    final Element xmlTeXOutput = new Element(XML_ELEMENT_KEY_TEX_OUTPUT);
    for (String line : texOutput) {
         xmlLine = new Element(XML_ELEMENT_KEY_LINE);
         xmlLine.setText(line);
         xmlTeXOutput.addContent(xmlLine);
    }
    genOut.addContent(xmlTeXOutput);
}

With this I get encoding errors in the XML (from the wrongly converted "ë"): "Invalid byte 2 of 3-byte UTF-8 sequence".

I found these questions: How to get console charset?, Java : How to determine the correct charset encoding of a stream - none give me any hope - it seems I have to set the correct encoding for the InputStreamReader, but there seems to be no portable method to find the encoding actually used. Is there a way to fix this?

Oh, and if possible a portable solution should work on MacOS too. And I don't want to set the encoding of the XML to ISO-8859-1 (which seems to be the common work-around according to Google): UTF-8 should work.

EDIT: I write the XML file thusly:

final XMLOutputter xmlOutputter = new XMLOutputter(Format.getPrettyFormat());
final String targetXMLFileName = FilenameUtils.concat(targetDirName, xmlID.getText() + "-out.xml");
final File targetXMLFile = new File(targetXMLFileName);
final FileWriter targetXMLFileWriter = new FileWriter(targetXMLFile);
xmlOutputter.output(xmlOutput, targetXMLFileWriter);
targetXMLFileWriter.close();

Upvotes: 1

Views: 950

Answers (1)

McDowell
McDowell

Reputation: 108959

There are a number of potential problems here:

  • "Cp1252" is not the default system encoding that the other application is using with stdout
  • the default encoding is not Unicode (which can cause data loss)
  • there is a transcoding error serializing your DOM to the XML file

Verify that data is being read correctly from the other process. If the default encoding is causing an issue, you may want to write a wrapper app with a main method that sets stdout to a Unicode-encoding stream and then invoke the other main. Then decode within the above code using the same encoding.

There is also a hack involving file.encoding but this may cause unintended side-effects.

If the problem is with serializing the XML it is likely that the data is being written with the wrong encoding even though the declaration is UTF-8. This commonly happens when serializing to a Writer as the serializer does not control the output encoding as it would with an OutputStream.


EDIT

The problem is here:

new FileWriter(targetXMLFile);

From the documentation:

Convenience class for writing character files. The constructors of this class assume that the default character encoding and the default byte-buffer size are acceptable.

If you always want UTF-8, construct a stream that writes UTF-8.

Upvotes: 1

Related Questions