Reputation: 736
I have a large UTF-8 input that is divided to 1-kB size chunks. I need to process it using a method that accepts String. Something like:
for (File file: inputs) {
byte[] b = FileUtils.readFileToByteArray(file);
String str = new String(b, "UTF-8");
processor.process(str);
}
My problem is that I have no guarantee that any UTF-8 character is not split between two chunks. The result of running my code is that some lines end with '?', which corrupts my input.
What would be a good approach to solve this?
Upvotes: 5
Views: 699
Reputation: 269647
If I understand correctly, you had a large text, which was encoded with UTF-8, then split into 1-kilobyte files. Now you want to read the text back, but you are concerned that an encoded character might be split across file boundaries, and cause a UTF-8 decoding error.
The API is a bit dusty, but there is a SequenceInputStream
that will create what appears to be a single InputStream
from a series of sub-streams. Create one of these with a collection of FileInputStream
instances, then create an InputStreamReader
that decodes the stream of UTF-8 bytes to text for your application.
Upvotes: 3