rds
rds

Reputation: 26994

Which encoding does Process.getInputStream() use?

In a Java program, I spawn a new Process via ProcessBuilder.

args[0] = directory.getAbsolutePath() + File.separator + program;
ProcessBuilder pb = new ProcessBuilder(args);
pb.directory(directory);
final Process process = pb.start();

Then, I read the process standard output with a new Thread

new Thread() {
    public void run() {
        BufferedReader reader = new BufferedReader(
            new InputStreamReader(process.getInputStream()));
        String line = "";
        while ((line = reader.readLine()) != null) {
            System.out.println(line);
    }
}.start();

However, when the process outputs non-ASCII characters (such as 'é'), the line has character '\uFFFD' instead.

What is the encoding in the InputStream returned by getInputStream (my platform is Windows in Europe)?

How can I change things so that line contains the expected data (i.e. '\u00E9' for 'é')?

Edit: I tried new InputStreamReader(...,"UTF-8"): é becomes \uFFFD

Upvotes: 8

Views: 19701

Answers (8)

thoni56
thoni56

Reputation: 3335

If you, like me, know in what encoding you want to use for all input/output, you can either encode it in the Java API calls to some (not all) CreateReader methods, which some other answers have pointed out.

But this will hard code it in the source, which might or might not, be ok.

I found a better way after reading this answer which reveals that you can set the encoding before the JVM starts up to what you need.

java -Dfile.encoding=ISO-8859-1 ...

Upvotes: 1

use commons-lang jar file in this use - StringEscapeUtils.escapeHtml

BufferedReader br = new BufferedReader(
    new InputStreamReader(StringEscapeUtils.escapeHtml(conn.getInputStream()));

Upvotes: 0

Cris
Cris

Reputation: 5007

I put this as a comment but i see there was an answer after ,so it might be redundant now :)

BufferedReader br = new BufferedReader(
    new InputStreamReader(conn.getInputStream(), "UTF-8"));

Upvotes: 0

Grim
Grim

Reputation: 1996

Scientific

On Windows this works perfect:

private static final Charset CONSOLE_ENCODING;
static {
    Charset enc = Charset.defaultCharset();
    try {
        String example = "äöüßДŹす";
        String command = File.separatorChar == '/' ? "echo " + example : "cmd.exe /c echo " + example;
        Process exec = Runtime.getRuntime().exec(command);
        InputStream inputStream = exec.getInputStream();
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        while (exec.isAlive()) {
            Thread.sleep(100);
        }
        byte[] buff = new byte[inputStream.available()];
        if (buff.length > 0) {
            int count = inputStream.read(buff);
            baos.write(buff, 0, count);
        }

        byte[] array = baos.toByteArray();
        for (Charset charset : Charset.availableCharsets().values()) {
            String s = new String(array, charset);
            if (s.equals(example)) {
                enc = charset;
                break;
            }
        }
    } catch (InterruptedException e) {
        throw new Error("Could not determine console charset.", e);
    } catch (IOException e) {
        throw new Error("Could not determine console charset.", e);
    }
    CONSOLE_ENCODING = enc;
}

According to specification: there is no hint for runtime-encoding change of jvm. We can not be sure that the encoding does NOT change while running and the charset still correct after such change.

Upvotes: 2

jan.supol
jan.supol

Reputation: 2805

Interestingly enough, when running on Windows:

ProcessBuilder pb = new ProcessBuilder("cmd", "/c dir");
Process process = pb.start();

Then CP437 code page works quite well for

new InputStreamReader(process.getInputStream(), "CP437");

Upvotes: 9

kan
kan

Reputation: 28961

As I understand, an operation system streams are byte-streams, there are no characters here. The InputStreamReader constructor uses jvm default character set java.nio.charset.Charset#defaultCharset(), you could use another constructor to explicitly specify a character set.

Upvotes: 4

AlexR
AlexR

Reputation: 115338

According to http://www.fileformat.info/info/unicode/char/e9/index.htm '\uFFFD' is a unicode code for character 'é'. It actually means that you are reading the stream correctly. Your problem is in writing.

Windows console does not support unicode by default. So, if you want to test your code open file and write your stream there. But do not forget to set the encoding UTF-8.

Upvotes: 2

Thilo
Thilo

Reputation: 262534

An InputStream is a binary stream, so there is no encoding. When you create the Reader, you need to know what character encoding to use, and that would depend on what the program you called produces (Java will not convert it in any way).

If you do not specify anything for InputStreamReader, it will use the platform default encoding, which may not be appropriate. There is another constructor that allows you to specify the encoding.

If you know what encoding to use (and you really have to know):

new InputStreamReader(process.getInputStream(), "UTF-8") // for example

Upvotes: 9

Related Questions