Reputation: 143

After changing file encoding Windows get it wrong

I wanted to change file's encoding form ones to the other(doesn't matter which). But when i open the file with the result(file w.txt) it is messed up inside. Windows does not understand it correct.

What result encoding should i put (args[1]) so it will be interpreted by windows notepad correct?

 import java.io.*;
import java.nio.charset.Charset;

public class Kodowanie {

    public static void main(String[] args) throws IOException {
        args = new String[2];
        args[0] = "plik.txt";
        args[1] = "ISO8859_2";
        String linia, s = "";
        File f = new File(args[0]), f1 = new File("w.txt");
        FileInputStream fis = new FileInputStream(f);
        InputStreamReader isr = new InputStreamReader(fis,
                Charset.forName("UTF-8"));
        BufferedReader in = new BufferedReader(isr);

        FileOutputStream fos = new FileOutputStream(f1);
        OutputStreamWriter osw = new OutputStreamWriter(fos,
                Charset.forName(args[1]));
        BufferedWriter out = new BufferedWriter(osw);
        while ((linia = in.readLine()) != null) {
            out.write(linia);
            out.newLine();
        }
        out.close();
        in.close();

    }

}

input:

Ala
ma 
Kota

output:

?Ala
ma 
Kota

Why there is a '?'

Upvotes: 1

Answers (2)

Edwin Dalorzo

Reputation: 78579

US-ASCII is a subset of unicode (a pretty small one by the way). You are reading a file in UTF-8 and then you write it back in US-ASCII. Thus your the encoder will have to take a desicion when a given UTF character cannot be expressed in terms of the reduced 7-bit US-ASCII subset. Clasically, this is repaced by a default charcter, like ?.

Take into account that characters in UTF-8 are multibyte in many cases, whereas US-ASCII is only 7-bit long. This means that al unicode characters above byte 127 cannot be expressed in US-ASCII. That could explain the question marks that you see once the file has been converted.

I had answered a similar question Reading Strange Unicode Characters in Java. Perhaps it helps.

I also recommend you to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Upvotes: 1

thedayofcondor

Reputation: 3876

The default encoding in Windows is Cp1252.

Upvotes: 1

After changing file encoding Windows get it wrong

Answers (2)

Related Questions