ModdyFire
ModdyFire

Reputation: 736

How to convert chunks of UTF-8 bytes to charcters?

I have a large UTF-8 input that is divided to 1-kB size chunks. I need to process it using a method that accepts String. Something like:

for (File file: inputs) {
     byte[] b = FileUtils.readFileToByteArray(file);
     String str = new String(b, "UTF-8");
     processor.process(str);
}

My problem is that I have no guarantee that any UTF-8 character is not split between two chunks. The result of running my code is that some lines end with '?', which corrupts my input.

What would be a good approach to solve this?

Upvotes: 5

Views: 699

Answers (1)

erickson
erickson

Reputation: 269647

If I understand correctly, you had a large text, which was encoded with UTF-8, then split into 1-kilobyte files. Now you want to read the text back, but you are concerned that an encoded character might be split across file boundaries, and cause a UTF-8 decoding error.

The API is a bit dusty, but there is a SequenceInputStream that will create what appears to be a single InputStream from a series of sub-streams. Create one of these with a collection of FileInputStream instances, then create an InputStreamReader that decodes the stream of UTF-8 bytes to text for your application.

Upvotes: 3

Related Questions