nullvoid
nullvoid

Reputation: 55

UTF-8 is not working for Java zipOutputStream

I am generating a zip file containing csv using ZipOutputStream. I have passed the encoding UTF-8 but the problem is that German umlauts are not compressed properly. When uncompressed, they do not appear properly in the file.

I am not sure if the problem is with compression itself or the decompression.

All the topics related to this issue are mainly about special characters in the filename, but for me the problem is appearing in the data.

   val zos = ZipOutputStream (outputStream, StandardCharsets.UTF_8)
        val entry = ZipEntry("file1.csv")
        zos.putNextEntry(entry)

        val writer = CsvWriter(zos)

        for (entr in data)
            writer.appendRow {entr.forEach { write(it) }}
        zos.closeEntry()
    zos.close()

Upvotes: 1

Views: 1621

Answers (2)

Willis Blackburn
Willis Blackburn

Reputation: 8204

I don't think that your example is correct, because you're passing a ZipOutputStream directly to CsvWriter. Assuming you are using OpenCSV, the CsvWriter constructor needs a Writer, not an OutputStream.

In Java, I/O streams are either byte streams, which are raw data; or character streams, which consist of Unicode characters. In order to convert from one to the other, you must supply a character encoding, which tells it how to convert characters to/from bytes. (If you don't provide one, then Java will use the default character encoding — which depends on the platform but is commonly UTF-8.) InputStream and OutputStream are byte streams, while the corresponding character streams are called Reader and Writer.

You have a ZipOutputStream, which is a byte stream. The OpenCSV CsvWriter constructor requires a Writer, a character stream, which makes sense because CSV is a text format. (I imagine this would be true of other CsvWriter implementations as well.) You should wrap your ZipOutputStream in an instance of OutputStreamWriter, which will convert the CSV characters into bytes. You can specify the character encoding in the OutputStreamWriter constructor.

Upvotes: 2

Henry
Henry

Reputation: 43738

From the docu:

charset - the charset to be used to encode the entry names and comment

So setting UTF-8 does not have any effect on the content which already has to be a stream of bytes.

The problem must occur in CsvWriter.

Upvotes: 2

Related Questions