splitting file based on 5000 bytes

Question

I have split the file based on the below code,

    int sizeOfFiles = 1024 * 3;// 1MB
    byte[] buffer = new byte[sizeOfFiles];

   // String fileName = f.getName();

    //try-with-resources to ensure closing stream
    try (ByteArrayInputStream fis = new ByteArrayInputStream(f);) {

        int bytesAmount = 0;
        int i=0;
        while ((bytesAmount = fis.read(buffer)) > 0) {

            String result="";
            for (byte b : buffer) {
                result+=(char)b;
            }

            System.out.println(result);

           System.out.print("--------------------------------------------------------");
        }
    }
}

But when I copy the first 3072 bytes in the buffer and paste it in the notepad++, I was getting to show that the same data is more than 3072 bytes. Can you please help me with this issue?

Note: I am using windows server, eclipse and file or string is in the format UTF-8 charset.

Stephen C · Accepted Answer

The first problem is that there is a bug in this line:

for (byte b : buffer) {

You are assuming that all of the byte positions in buffer contain data. But what if the read call returned fewer than sizeOfFiles bytes?

The second problem is that this line is liable to mangle the data.

result += (char) b;

You are taking each byte of input and casting it to a character. But if the input file is binary, those bytes don't represent characters. Alternatively, if the input is text, then a real character in the input may be encoded as 2 or more bytes, for example. Either way, when you cast from a byte to char you are not getting proper Unicode code units to append to the string

(The only cases where what you are doing would "work" are is the input file is ASCII or LATIN-1 encoded text.)

This mangling may well be increasing the number of bytes relative to the input stream, especially if you are outputting in UTF-8. Any input byte in the range 128 to 255 will turn into 2 bytes when cast to a char and then encoded in UTF-8.

Finally, when you use println to output the string you are adding an extra line separator after each buffer-full of data.

splitting file based on 5000 bytes

Answers (1)

Related Questions