Reputation: 105053
I need to convert a stream of bytes to a line of UTF-8 characters. The only character that is important for me in that line is the last one. And this conversion should happen in a cycle, so the performance is very important. A simple and inefficient approach would be:
public class Foo {
private ByteArrayOutputStream buffer = new ByteArrayOutputStream();
void next(byte input) {
this.buffer.write(input);
String text = this.buffer.toString("UTF-8"); // this is time consuming
if (text.charAt(text.length() - 1) == THE_CHAR_WE_ARE_WAITING_FOR) {
System.out.println("hurray!");
this.buffer.reset();
}
}
}
Conversion of byte array to string happens on every input byte, which is, in my understanding, very ineffective. Is it possible to do it somehow else to preserve the results of bytes-to-text conversion from a previous cycle?
Upvotes: 6
Views: 12757
Reputation: 180887
You can use a simple class to keep track of the characters, and only convert when you got a full UTF8 sequence. Here's a sample (with no error checking which you may want to add)
class UTF8Processor {
private byte[] buffer = new byte[6];
private int count = 0;
public String processByte(byte nextByte) throws UnsupportedEncodingException {
buffer[count++] = nextByte;
if(count == expectedBytes())
{
String result = new String(buffer, 0, count, "UTF-8");
count = 0;
return result;
}
return null;
}
private int expectedBytes() {
int num = buffer[0] & 255;
if(num < 0x80) return 1;
if(num < 0xe0) return 2;
if(num < 0xf0) return 3;
if(num < 0xf8) return 4;
return 5;
}
}
class Bop
{
public static void main (String[] args) throws java.lang.Exception
{
// Create test data.
String str = "Hejsan åäö/漢ya";
byte[] bytes = str.getBytes("UTF-8");
String ch;
// Processes byte by byte, returns a valid UTF8 char when
//there is a complete one to get.
UTF8Processor processor = new UTF8Processor();
for(int i=0; i<bytes.length; i++)
{
if((ch = processor.processByte(bytes[i])) != null)
System.out.println(ch);
}
}
}
Upvotes: 6
Reputation: 5537
Based on the comment:
It's line feed (0x0A)
Your next
method can just check:
if ((char)input == THE_CHAR_WE_ARE_WAITING_FOR) {
//whatever your logic is.
}
You don't have to do any conversion for characters < 128.
Upvotes: 2
Reputation: 66243
You have two options:
If the codepoint you are interested in is something simple (in UTF-8 terms) as a codepoint below 128, then a simple cast from byte
to char
is possible. Lookup the encoding rules on Wikipadia: UTF-8 for the reason why this works.
If this is not possible, you can take a look at the Charset
class which is the root of Java's encoding/decoding library. Here you will find CharsetDecoder
which you can feed N bytes and get back M characters. The general case is N != M . However you will have to deal with ByteBuffer
and CharBuffer
.
Upvotes: 1
Reputation: 7549
Wrap your byte-getting code in an InputStream and pass that to an InputStreamReader.
InputStreamReader isr = new InputStreamReader(new InputStream() {
@Override
public int read() throws IOException {
return xx();// wherever you get your data from.
}
}, "UTF-8");
while(true) {
try {
if(isr.read() == THE_CHAR_WE_ARE_WAITING_FOR)
System.out.println("hurray!");
} catch(IOException e) {
e.printStackTrace();
}
}
Upvotes: 0