Reputation: 311
How to split a string containing non ascii characters based on the byte size limit? I want to split the below string and add to a List and the split is based on the size limit (e.g) 3 bytes.
The problem here is extended ascii char takes 2 characters and after split the data become junk as shown in the actual output.
what I want is the expected output as given below, its ok to write only 2 bytes, if we come across non-ascii char. Please let me know how to resolve it. Problem:
String words = "Hello woræd æåéøòôóâ";
List<String> payloads = new ArrayList<>();
try( ByteArrayOutputStream outStream = new ByteArrayOutputStream();) {
byte[] chars = words.getBytes(StandardCharsets.UTF_8);
for (byte ch: chars) {
outStream.write(ch);
if (outStream.size() >= 3) {
String s = outStream.toString("UTF-8");
payloads.add(s);
outStream.flush();
outStream.reset();
}
}
payloads.add(outStream.toString("UTF-8"));
outStream.flush();
System.out.println(payloads);
} catch (IOException e) {
e.printStackTrace();
}
Actual Output: [Hel, lo , wor, æd, �, �å, é�, �ò, ô�, �â, ]
Expected output: [Hel, lo , wor, æd, ,æ, å, é, ø, ò, ô, ó, â] ]
Upvotes: 0
Views: 518
Reputation: 1100
It's UTF-8. UTF-8 is designed so that you can easlly detect character boundaries.
So: convert String to UTF-8 bytes.
Then backtrack until the first excluded byte is a legitimate 'first byte', i.e. not 10xxxxxx. You are now positioned at a character boundary.
Upvotes: 1