Raj
Raj

Reputation: 311

How to split a string containing non ascii characters based on the byte size limit?

How to split a string containing non ascii characters based on the byte size limit? I want to split the below string and add to a List and the split is based on the size limit (e.g) 3 bytes.

The problem here is extended ascii char takes 2 characters and after split the data become junk as shown in the actual output.

what I want is the expected output as given below, its ok to write only 2 bytes, if we come across non-ascii char. Please let me know how to resolve it. Problem:

String words = "Hello woræd  æåéøòôóâ";
        List<String> payloads = new ArrayList<>();
        try( ByteArrayOutputStream outStream = new ByteArrayOutputStream();) {
            byte[] chars = words.getBytes(StandardCharsets.UTF_8);
             for (byte ch: chars) {
                 outStream.write(ch);
                 if (outStream.size() >= 3) {
                     String s = outStream.toString("UTF-8");
                     payloads.add(s);
                     outStream.flush();
                     outStream.reset();
                 }
             }
            payloads.add(outStream.toString("UTF-8"));
            outStream.flush();
            System.out.println(payloads);
        } catch (IOException e) {
            e.printStackTrace();
        }

Actual Output: [Hel, lo , wor, æd, �, �å, é�, �ò, ô�, �â, ]

Expected output: [Hel, lo , wor, æd, ,æ, å, é, ø, ò, ô, ó, â] ]

Upvotes: 0

Views: 518

Answers (1)

user16632363
user16632363

Reputation: 1100

It's UTF-8. UTF-8 is designed so that you can easlly detect character boundaries.

So: convert String to UTF-8 bytes.

Then backtrack until the first excluded byte is a legitimate 'first byte', i.e. not 10xxxxxx. You are now positioned at a character boundary.

Upvotes: 1

Related Questions