dynamic
dynamic

Reputation: 48091

Reading UTF-8 file and writing plain ANSI?

I have an UTF-8 file (it's a csv).
I need to read line by line this file do some replace and then write line by line into another file.

    BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(
        new FileOutputStream(fileFix), "ASCII")
    );
    bw.write("");   //clean current file


    BufferedReader br = new BufferedReader(new InputStreamReader(
        new FileInputStream(file),"UTF-8")
    );

    String line;
    while ((line = br.readLine()) != null) {
        line = line.replace(";", ",");
        bw.append(line + "\n");
    }

Simple as that.
The problem is that the output file (fileFix) is UTF-8 and i think it has the BOM character.

How can I write the file as plain ANSI without the BOM?

The error I am getting while reading my file with a software (weka)

enter image description here

The first line of this file:

enter image description here

Consider that notepad++ tells me the charset is UTF-8. If i try to convert this file in plain ASCII (with windows notepad), that chars disappers

Solution

When you are on the first line run:

line = line.substring(1);

To remove any BOM char.

Upvotes: 2

Views: 7530

Answers (2)

Konstantin
Konstantin

Reputation: 3696

Look at http://en.wikipedia.org/wiki/Byte_order_mark for the pattern to replace, looks like EF BB BF rather than FE FF

This solution is wrong check Jons answer intsead

Upvotes: 1

Jon Skeet
Jon Skeet

Reputation: 1499790

It sounds like this is a BOM issue rather than an encoding issue as such.

You can just remove any BOM characters as you write the file, with:

line = line.replace("\ufeff", "");

That leaves the question of whether you're reading the data accurately in the first place... I'd strongly advise you not to use FileWriter and FileReader at all - instead, use InputStreamReader and OutputStreamWriter, specifying the encoding explicitly for both of them. Set the reader encoding to UTF-8 (assuming the input file really is UTF-8), and set the writer encoding to whatever you want... but I'd recommend sticking with UTF-8, to be honest.

Also note that you should be closing your reader/writer in finally blocks, or using the try-with-resources statement if you're using Java 7.

Upvotes: 5

Related Questions