Reputation: 16053
The following code takes a String s
, converts into char
array, filters digits from it, then converts it to string
, then converts into byte
array.
char charArray[] = s.toCharArray();
StringBuffer sb = new StringBuffer(charArray.length);
for(int i=0; i<=charArray.length-1; i++) {
if (Character.isDigit(charArray[i]))
sb.append(charArray[i]);
}
byte[] bytes = sb.toString().getBytes(Charset.forName("UTF-8"));
I'm trying to change the above code to streams approach. Following is working.
s.chars()
.sequential()
.mapToObj(ch -> (char) ch)
.filter(Character::isDigit)
.collect(StringBuilder::new,
StringBuilder::append, StringBuilder::append)
.toString()
.getBytes(Charset.forName("UTF-8"));
I think there could be a better way to do it.
Can we directly convert Stream<Character>
to byte[]
& skip the conversion to String
in between?
Upvotes: 1
Views: 412
Reputation: 298123
First, both of your variants have the problem of not handling characters outside the BMP correctly.
To support these characters, there is codePoints()
as an alternative to chars()
. You can use appendCodePoint
on the target StringBuilder
to consistently use codepoints throughout the entire operation. For this, you have to remove the unnecessary .mapToObj(ch -> (char) ch)
step, whose removal also eliminates the overhead of creating a Stream<Character>
.
Then, you can avoid the conversion to a String
in both cases, by encoding the StringBuilder
using the Charset
directly. In the case of the stream variant:
StringBuilder sb = s.codePoints()
.filter(Character::isDigit)
.collect(StringBuilder::new,
StringBuilder::appendCodePoint, StringBuilder::append);
ByteBuffer bb = StandardCharsets.UTF_8.encode(CharBuffer.wrap(sb));
byte[] utf8Bytes = new byte[bb.remaining()];
bb.get(utf8Bytes);
Performing the conversion directly with the stream of codepoints is not easy. Not only is there no such support in the Charset
API, there is no straight-forward way to collect a Stream into a byte[]
array.
One possibility is
byte[] utf8Bytes = s.codePoints()
.filter(Character::isDigit)
.flatMap(c -> c<128? IntStream.of(c):
c<0x800? IntStream.of((c>>>6)|0xC0, c&0x3f|0x80):
c<0x10000? IntStream.of((c>>>12)|0xE0, (c>>>6)&0x3f|0x80, c&0x3f|0x80):
IntStream.of((c>>>18)|0xF0, (c>>>12)&0x3f|0x80, (c>>>6)&0x3f|0x80, c&0x3f|0x80))
.collect(
() -> new Object() { byte[] array = new byte[8]; int size;
byte[] result(){ return array.length==size? array: Arrays.copyOf(array,size); }
},
(b,i) -> {
if(b.array.length == b.size) b.array=Arrays.copyOf(b.array, b.size*2);
b.array[b.size++] = (byte)i;
},
(a,b) -> {
if(a.array.length<a.size+b.size) a.array=Arrays.copyOf(a.array,a.size+b.size);
System.arraycopy(b.array, 0, a.array, a.size, b.size);
a.size+=b.size;
}).result();
The flatMap
step converts the stream of codepoints to a stream of UTF-8 unit. (Compare with the UTF-8 description on Wikipedia) The collect
step collects int
values into a byte[]
array.
It’s possible to eliminate the flatMap
step by creating a dedicate collector which collects a stream of codepoints directly into a byte[]
array
byte[] utf8Bytes = s.codePoints()
.filter(Character::isDigit)
.collect(
() -> new Object() { byte[] array = new byte[8]; int size;
byte[] result(){ return array.length==size? array: Arrays.copyOf(array,size); }
void put(int c) {
if(array.length == size) array=Arrays.copyOf(array, size*2);
array[size++] = (byte)c;
}
},
(b,c) -> {
if(c < 128) b.put(c);
else {
if(c<0x800) b.put((c>>>6)|0xC0);
else {
if(c<0x10000) b.put((c>>>12)|0xE0);
else {
b.put((c>>>18)|0xF0);
b.put((c>>>12)&0x3f|0x80);
}
b.put((c>>>6)&0x3f|0x80);
}
b.put(c&0x3f|0x80);
}
},
(a,b) -> {
if(a.array.length<a.size+b.size) a.array=Arrays.copyOf(a.array,a.size+b.size);
System.arraycopy(b.array, 0, a.array, a.size, b.size);
a.size+=b.size;
}).result();
but it doesn’t add to readability.
You can test the solutions using a String
like
String s = "some test text 1234 ✔ 3 𝟝";
and printing the result as
System.out.println(Arrays.toString(utf8Bytes));
System.out.println(new String(utf8Bytes, StandardCharsets.UTF_8));
which should produce
[49, 50, 51, 52, -17, -68, -109, -16, -99, -97, -99]
12343𝟝
It should be obvious that the first variant is the simplest, and it will have reasonable performance, even if it doesn’t create a byte[]
array directly. Further, it’s the only variant which can be adapted for getting other result charsets.
But even the
byte[] utf8Bytes = s.codePoints()
.filter(Character::isDigit)
.collect(StringBuilder::new,
StringBuilder::appendCodePoint, StringBuilder::append)
.toString().getBytes(StandardCharsets.UTF_8);
is not so bad, regardless of whether the toString()
operation bears a copying operation.
Upvotes: 4