Gili
Gili

Reputation: 90023

What Charset does ByteBuffer.asCharBuffer() use?

What Charset does ByteBuffer.asCharBuffer() use? It seems to convert 3 bytes to one character on my system.

On a related note, how does CharsetDecoder relate to ByteBuffer.asCharBuffer()?

UPDATE: With respect to what implementation of ByteBuffer I am using, I am invoking ByteBuffer.allocate(1024).asCharBuffer(). I can't comment on what implementation gets used under the hood.

Upvotes: 6

Views: 2102

Answers (4)

RajV
RajV

Reputation: 7170

I wanted to expand on the answer by @Petteri H. It is true that asCharBuffer() expects the ByteBuffer to be already UTF-16 encoded. No further encoding conversion is performed. You can run an experiment using the code below.

First, create a plain text file called test.txt with a few lines.

Hello World
Hi Moon
Howdy Jupiter

This file will be UTf-8 encoded by default. We expect this to be a problem since CharBuffer will read two consecutive bytes to construct a character and give you garbage values. Later, we will fix the issue.

The following code will simply dump each character from the file. Note: It will treat each double byte sequence as a character.

import java.io.RandomAccessFile;
import java.nio.*;
import java.nio.channels.FileChannel;
import java.util.HashMap;

public class Main {
    public static void main(String[] args) {
         try (var file = new RandomAccessFile("test.txt", "r")) {
            var mappedMemory = file.getChannel()
                    .map(FileChannel.MapMode.READ_ONLY, 0, file.length());
            var buff = mappedMemory.asCharBuffer();

            for (int i = 0; i < buff.length(); ++i) {
                var ch = buff.get(i);

                System.out.print(ch);
            }
         } catch (Exception e) {
            e.printStackTrace();
         }
    }
}

When you run the code you will see unexpected characters:

䡥汬漠坯牬搊䡩⁍潯渊䡯睤礠䩵灩瑥爊

Now, let's encode the same file using UTF-16.

iconv -f utf-8 -t utf-16 test.txt > test-fixed.txt

Change Java code to read test-fixed.txt. Then run it again.

Now, you will see the right output.

It is interesting to note that CharBuffer skips the BOM marker which test-fixed.txt file will have.

Upvotes: 0

Gili
Gili

Reputation: 90023

Looking at jdk7, jdk/src/share/classes/java/nio

  1. X-Buffer.java.template maps ByteBuffer.allocate() to Heap-X-Buffer.java.template
  2. Heap-X-Buffer.java.template maps ByteBuffer.asCharBuffer() to ByteBufferAs-X-Buffer.java.template
  3. ByteBuffer.asCharBuffer().toString() invokes CharBuffer.put(CharBuffer) but I can't figure out where this leads

Eventually this probably leads to Bits.makeChar() which is defined as:

static private char makeChar(byte b1, byte b0) {
    return (char)((b1 << 8) | (b0 & 0xff));
}

but I can't figure out how.

Upvotes: 0

Voo
Voo

Reputation: 30216

As I understand it, it doesn't use anything. It just assumes it is already correctly decoded as a string for Java, which means UTF-16. This can be shown by looking at the source for the HeapByteBuffer, where the returned charbuffer finally calls (little endian version):

static private char makeChar(byte b1, byte b0) {
return (char)((b1 << 8) | (b0 & 0xff));
}

So the only thing that is handled here is the endianness for the rest you're responsible. Which also means it's usually much more useful to use the Decoder class where you can specify the encoding.

Upvotes: 2

Petteri H
Petteri H

Reputation: 12222

For the first question - I believe it uses native character encoding of Java (UTF-16).

Upvotes: 4

Related Questions