yegor256
yegor256

Reputation: 105053

How to convert stream of bytes to UTF-8 characters?

I need to convert a stream of bytes to a line of UTF-8 characters. The only character that is important for me in that line is the last one. And this conversion should happen in a cycle, so the performance is very important. A simple and inefficient approach would be:

public class Foo {
  private ByteArrayOutputStream buffer = new ByteArrayOutputStream();
  void next(byte input) {
    this.buffer.write(input);
    String text = this.buffer.toString("UTF-8"); // this is time consuming
    if (text.charAt(text.length() - 1) == THE_CHAR_WE_ARE_WAITING_FOR) {
      System.out.println("hurray!");
      this.buffer.reset();
    }   
  }
}

Conversion of byte array to string happens on every input byte, which is, in my understanding, very ineffective. Is it possible to do it somehow else to preserve the results of bytes-to-text conversion from a previous cycle?

Upvotes: 6

Views: 12757

Answers (4)

Joachim Isaksson
Joachim Isaksson

Reputation: 180887

You can use a simple class to keep track of the characters, and only convert when you got a full UTF8 sequence. Here's a sample (with no error checking which you may want to add)

class UTF8Processor {
    private byte[] buffer = new byte[6];
    private int count = 0;

    public String processByte(byte nextByte) throws UnsupportedEncodingException {
        buffer[count++] = nextByte;
        if(count == expectedBytes())
        {
            String result = new String(buffer, 0, count, "UTF-8");
            count = 0;
            return result;
        }
        return null;
    }

    private int expectedBytes() {
        int num = buffer[0] & 255;
        if(num < 0x80) return 1;
        if(num < 0xe0) return 2;
        if(num < 0xf0) return 3;
        if(num < 0xf8) return 4;
        return 5;
    }
}

class Bop
{
    public static void main (String[] args) throws java.lang.Exception
    {
        // Create test data.
        String str = "Hejsan åäö/漢ya";
        byte[] bytes = str.getBytes("UTF-8");

        String ch;

        // Processes byte by byte, returns a valid UTF8 char when 
        //there is a complete one to get.

        UTF8Processor processor = new UTF8Processor();

        for(int i=0; i<bytes.length; i++)
        {
            if((ch = processor.processByte(bytes[i])) != null)
                System.out.println(ch);
        }
    }
}

Upvotes: 6

Aurand
Aurand

Reputation: 5537

Based on the comment:

It's line feed (0x0A)

Your next method can just check:

if ((char)input == THE_CHAR_WE_ARE_WAITING_FOR) {
    //whatever your logic is.
}

You don't have to do any conversion for characters < 128.

Upvotes: 2

A.H.
A.H.

Reputation: 66243

You have two options:

  • If the codepoint you are interested in is something simple (in UTF-8 terms) as a codepoint below 128, then a simple cast from byte to char is possible. Lookup the encoding rules on Wikipadia: UTF-8 for the reason why this works.

  • If this is not possible, you can take a look at the Charset class which is the root of Java's encoding/decoding library. Here you will find CharsetDecoder which you can feed N bytes and get back M characters. The general case is N != M . However you will have to deal with ByteBuffer and CharBuffer.

Upvotes: 1

Clyde
Clyde

Reputation: 7549

Wrap your byte-getting code in an InputStream and pass that to an InputStreamReader.

    InputStreamReader isr = new InputStreamReader(new InputStream() {
        @Override
        public int read() throws IOException {
            return xx();// wherever you get your data from.
        }
    }, "UTF-8");
    while(true) {
        try {
            if(isr.read() == THE_CHAR_WE_ARE_WAITING_FOR)
                System.out.println("hurray!");
        } catch(IOException e) {
            e.printStackTrace(); 
        }
    }

Upvotes: 0

Related Questions