Mike
Mike

Reputation: 755

icu4j: read and write files with differenz charsets

i am using java to parse a folder, and read the files. In the folder are only txt-files. But with different charsets. Some of them are in ISO-8859-1 and some of them are in windows-1252.

I need to read the file and create one single file from all. So i append the content. See my code:

File fiout = new File("single_"+System.currentTimeMillis()+".csv");
PrintWriter writer = new PrintWriter(fiout);
for( int x=0; x < all_zipEntries.size(); x++ ){
    File fi = (File)all_zipEntries.get( x );
    String zipfilename = fi.getName();
                
    String charset = getCharset(fi);
    Charset inputCharset = Charset.forName(charset);
                    
    log.println("Read "+zipfilename+" ... (Charset "+charset+" ... "+inputCharset.toString()+")");
                    
    FileInputStream fis = new FileInputStream(fi.getName());
    InputStreamReader isr = new InputStreamReader(fis, inputCharset);
    BufferedReader in = new BufferedReader(isr);
    while ( in.ready() ) {
        String row = in.readLine(); 
        writer.println(row);
    }
    in.close();
    isr.close();
    fis.close();
}
writer.close();

This is my log:

Read 01.csv ... (Charset ISO-8859-1 ... ISO-8859-1)
Read 02.csv ... (Charset ISO-8859-1 ... ISO-8859-1)
Read 03.csv ... (Charset windows-1252 ... windows-1252)
Read 04.csv ... (Charset windows-1252 ... windows-1252)
Read 05.csv ... (Charset windows-1252 ... windows-1252)
Read 06.csv ... (Charset windows-1252 ... windows-1252)
Read 07.csv ... (Charset windows-1252 ... windows-1252)
Read 08.csv ... (Charset windows-1252 ... windows-1252)
Read 09.csv ... (Charset windows-1252 ... windows-1252)

You see the first 2 files are ISO coded, the last are windows-1252

My default charset is ISO-8859-1. In the result file that was createt by the code above i have some lines with

Äpfel
Äpfel
Äpfel

and i have lines like

?pfel
?pfel

The last one are from the files 3 till 9. It seems to me he did not convert from windows-1252 to ISO correctly. But i set the charset at reading!

Upvotes: 0

Views: 92

Answers (0)

Related Questions