Pankaj Singhal
Pankaj Singhal

Reputation: 16053

Stream of Char to Stream of Byte/Byte Array

The following code takes a String s, converts into char array, filters digits from it, then converts it to string, then converts into byte array.

char charArray[] = s.toCharArray();
StringBuffer sb = new StringBuffer(charArray.length);
for(int i=0; i<=charArray.length-1; i++) {
    if (Character.isDigit(charArray[i]))
        sb.append(charArray[i]);
}
byte[] bytes = sb.toString().getBytes(Charset.forName("UTF-8")); 

I'm trying to change the above code to streams approach. Following is working.

s.chars()
.sequential()
.mapToObj(ch -> (char) ch)
.filter(Character::isDigit)
.collect(StringBuilder::new,
        StringBuilder::append, StringBuilder::append)
.toString()
.getBytes(Charset.forName("UTF-8"));

I think there could be a better way to do it.

Can we directly convert Stream<Character> to byte[] & skip the conversion to String in between?

Upvotes: 1

Views: 412

Answers (1)

Holger
Holger

Reputation: 298123

First, both of your variants have the problem of not handling characters outside the BMP correctly.

To support these characters, there is codePoints() as an alternative to chars(). You can use appendCodePoint on the target StringBuilder to consistently use codepoints throughout the entire operation. For this, you have to remove the unnecessary .mapToObj(ch -> (char) ch) step, whose removal also eliminates the overhead of creating a Stream<Character>.

Then, you can avoid the conversion to a String in both cases, by encoding the StringBuilder using the Charset directly. In the case of the stream variant:

StringBuilder sb = s.codePoints()
    .filter(Character::isDigit)
    .collect(StringBuilder::new,
             StringBuilder::appendCodePoint, StringBuilder::append);

ByteBuffer bb = StandardCharsets.UTF_8.encode(CharBuffer.wrap(sb));
byte[] utf8Bytes = new byte[bb.remaining()];
bb.get(utf8Bytes);

Performing the conversion directly with the stream of codepoints is not easy. Not only is there no such support in the Charset API, there is no straight-forward way to collect a Stream into a byte[] array.

One possibility is

byte[] utf8Bytes = s.codePoints()
    .filter(Character::isDigit)
    .flatMap(c -> c<128? IntStream.of(c):
        c<0x800? IntStream.of((c>>>6)|0xC0, c&0x3f|0x80):
        c<0x10000? IntStream.of((c>>>12)|0xE0, (c>>>6)&0x3f|0x80, c&0x3f|0x80):
        IntStream.of((c>>>18)|0xF0, (c>>>12)&0x3f|0x80, (c>>>6)&0x3f|0x80, c&0x3f|0x80))
    .collect(
        () -> new Object() { byte[] array = new byte[8]; int size;
            byte[] result(){ return array.length==size? array: Arrays.copyOf(array,size); }
        },
        (b,i) -> {
            if(b.array.length == b.size) b.array=Arrays.copyOf(b.array, b.size*2);
            b.array[b.size++] = (byte)i;
        },
        (a,b) -> {
            if(a.array.length<a.size+b.size) a.array=Arrays.copyOf(a.array,a.size+b.size);
            System.arraycopy(b.array, 0, a.array, a.size, b.size);
            a.size+=b.size;
        }).result();

The flatMap step converts the stream of codepoints to a stream of UTF-8 unit. (Compare with the UTF-8 description on Wikipedia) The collect step collects int values into a byte[] array.

It’s possible to eliminate the flatMap step by creating a dedicate collector which collects a stream of codepoints directly into a byte[] array

byte[] utf8Bytes = s.codePoints()
    .filter(Character::isDigit)
    .collect(
        () -> new Object() { byte[] array = new byte[8]; int size;
            byte[] result(){ return array.length==size? array: Arrays.copyOf(array,size); }
            void put(int c) {
                if(array.length == size) array=Arrays.copyOf(array, size*2);
                array[size++] = (byte)c;
            }
        },
        (b,c) -> {
            if(c < 128) b.put(c);
            else {
                if(c<0x800) b.put((c>>>6)|0xC0);
                else {
                    if(c<0x10000) b.put((c>>>12)|0xE0);
                    else {
                        b.put((c>>>18)|0xF0);
                        b.put((c>>>12)&0x3f|0x80);
                    }
                    b.put((c>>>6)&0x3f|0x80);
                }
                b.put(c&0x3f|0x80);
            }
       },
       (a,b) -> {
            if(a.array.length<a.size+b.size) a.array=Arrays.copyOf(a.array,a.size+b.size);
            System.arraycopy(b.array, 0, a.array, a.size, b.size);
            a.size+=b.size;
       }).result();

but it doesn’t add to readability.

You can test the solutions using a String like

String s = "some test text 1234 ✔ 3 𝟝";

and printing the result as

System.out.println(Arrays.toString(utf8Bytes));
System.out.println(new String(utf8Bytes, StandardCharsets.UTF_8));

which should produce

[49, 50, 51, 52, -17, -68, -109, -16, -99, -97, -99]
12343𝟝

It should be obvious that the first variant is the simplest, and it will have reasonable performance, even if it doesn’t create a byte[] array directly. Further, it’s the only variant which can be adapted for getting other result charsets.

But even the

byte[] utf8Bytes = s.codePoints()
    .filter(Character::isDigit)
    .collect(StringBuilder::new,
             StringBuilder::appendCodePoint, StringBuilder::append)
    .toString().getBytes(StandardCharsets.UTF_8);

is not so bad, regardless of whether the toString() operation bears a copying operation.

Upvotes: 4

Related Questions