Reputation: 423
Some analysis on a Java application showed that it's spending a lot of time decoding UTF-8 byte arrays into String objects. The stream of UTF-8 bytes are coming from a LMDB database, and the values in the database are Protobuf messages, which is why it's decoding UTF-8 so much. Another problem being caused by this is Strings are taking up a large chunk of memory because of the decoding from the memory-map into a String object in the JVM.
I want to refactor this application so it does not allocate a new String every time it reads a message from the database. I want the underlying char array in the String object to simply point to the memory location.
package testreflect;
import java.lang.reflect.Field;
import sun.misc.Unsafe;
public class App {
public static void main(String[] args) throws Exception {
Field field = Unsafe.class.getDeclaredField("theUnsafe");
field.setAccessible(true);
Unsafe UNSAFE = (Unsafe) field.get(null);
char[] sourceChars = new char[] { 'b', 'a', 'r', 0x2018 };
// Encoding to a byte array; asBytes would be an LMDB entry
byte[] asBytes = new byte[sourceChars.length * 2];
UNSAFE.copyMemory(sourceChars,
UNSAFE.arrayBaseOffset(sourceChars.getClass()),
asBytes,
UNSAFE.arrayBaseOffset(asBytes.getClass()),
sourceChars.length*(long)UNSAFE.arrayIndexScale(sourceChars.getClass()));
// Copying the byte array to the char array works, but is there a way to
// have the char array simply point to the byte array without copying?
char[] test = new char[sourceChars.length];
UNSAFE.copyMemory(asBytes,
UNSAFE.arrayBaseOffset(asBytes.getClass()),
test,
UNSAFE.arrayBaseOffset(test.getClass()),
asBytes.length*(long)UNSAFE.arrayIndexScale(asBytes.getClass()));
// Allocate a String object, but set its underlying
// byte array manually to avoid the extra memory copy
long stringOffset = UNSAFE.objectFieldOffset(String.class.getDeclaredField("value"));
String stringTest = (String) UNSAFE.allocateInstance(String.class);
UNSAFE.putObject(stringTest, stringOffset, test);
System.out.println(stringTest);
}
}
So far, I've figured out how to copy a byte array to a char array and set the underlying array in a String object using the Unsafe package. This should reduce the amount of CPU time the application is wasting decoding UTF-8 bytes.
However, this does not solve the memory problem. Is there a way to have a char array point to a memory location and avoid a memory allocation altogether? Avoiding the copy altogether will reduce the number of unnecessary allocations the JVM is making for these strings, leaving more room for the OS to cache entries from the LMDB database.
Upvotes: 0
Views: 1361
Reputation: 719596
I think you are taking the wrong approach here.
So far, I've figured out how to copy a byte array to a char array and set the underlying array in a String object using the Unsafe package. This should reduce the amount of CPU time the application is wasting decoding UTF-8 bytes.
Erm ... no.
Using memory copy to copy from a byte[]
to char[]
is not going to work. Each char
in the destination char[]
will actually contain 2 bytes from the original. If you then try to wrap the char[]
as a String
, you will get a weird kind of mojibake.
What a real UTF-8 to String
conversion does it to convert between 1 and 4 bytes (codeunits) representing a UTF-8 codepoint into 1 or 2 16-bit codeunits representing the same codepoint in UTF-16. That cannot be done using a plain memory copy.
If you aren't familiar with it, it would be worth reading the Wikipedia article on UTF-8 so that you understand how the text is encoded.
The solution depends on what you intend to do with the text data.
If the data must really be in the form of String
(or StringBuilder
or char[]
) objects, then you really have no choice but to do the full conversion. Try anything else and you are liable to mess up; e.g. garbled text and potential JVM crashes.
If you want something that is "string like", you could conceivably implement a custom subclass of CharSequence
, that wraps the bytes in the messages and decodes the UTF-8 on the fly. But doing that efficiently make be a problem, especially implementing the charAt
method as an O(1)
method.
If you are simply wanting to hold and/or compare the (entire) texts, this could possibly be done by representing them as or in a byte[]
objects. These operations can be performed on the UTF-8 encoded data directly.
If the input text could actually be sent in character encoding with a fixed 8-bit character size (e.g. ASCII, Latin-1, etc) or as UTF-16, that simplifies things.
Upvotes: 3