BAMF4bacon
BAMF4bacon

Reputation: 603

Most efficient typecasting or converting long or int to 4 char string

My goal is to conserve space in my data store, which only accepts Strings.

Because a String in Java is a 16-bit array, I figure that in theory I should be able to convert my 8-byte long into a 4-char String, as both are represented by 8 bytes. (To be clear, I am not interested in making my long integer human-readable in base 10, I want to store it in as short of a String as possible.)

However, almost all the literature I have found on this is about converting to the 8-bit byte type, not the type char.

I could encode as UTF8. I am concerned this would mean I double the length of String, as each 8-bit byte is stored as a 16-bit char. This would defeat my whole purpose for compacting my data into a 64-bit medium in the first place.

private static final Charset UTF8_CHARSET = Charset.forName("UTF-8");
new String(ByteBuffer.allocate(8).putLong(value).array(), UTF8_CHARSET);

Is my concern correct that I would be wasting space, and if so, is there a way to not waste space?

Upvotes: 0

Views: 186

Answers (1)

user177800
user177800

Reputation:

char != int

Q: Are there any byte sequences that are not generated by a UTF? How should I interpret them?

A: None of the UTFs can generate every arbitrary byte sequence. For example, in UTF-8 every byte of the form 110xxxxx2 must be followed with a byte of the form 10xxxxxx2. A sequence such as <110xxxxx2 0xxxxxxx2> is illegal, and must never be generated. When faced with this illegal byte sequence while transforming or interpreting, a UTF-8 conformant process must treat the first byte 110xxxxx2 as an illegal termination error: for example, either signaling an error, filtering the byte out, or representing the byte with a marker such as FFFD (REPLACEMENT CHARACTER). In the latter two cases, it will continue processing at the second byte 0xxxxxxx2.

A conformant process must not interpret illegal or ill-formed byte sequences as characters, however, it may take error recovery actions. No conformant process may use irregular byte sequences to encode out-of-band information.

String != byte[] && char != int

Internally String objects are Unicode and encoded as UTF-16 no matter what their source is.

How is text represented in the Java platform?

The Java programming language is based on the Unicode character set, and several libraries implement the Unicode standard. The primitive data type char in the Java programming language is an unsigned 16-bit integer that can represent a Unicode code point in the range U+0000 to U+FFFF, or the code units of UTF-16. The various types and classes in the Java platform that represent character sequences - char[], implementations of java.lang.CharSequence (such as the String class), and implementations of java.text.CharacterIterator - are UTF-16 sequences.

String is internally represented by UTF-16

The character encodings like UTF-8 are only for interpreting or converting to/from a byte[].

Even if you write a custom CharsetProvider all that will do is encode/decode a byte[] externally, this will absolutely not change the fact that a String is internally represented by UTF-16, so what you want to do is kind of pointless.

Can't be done

Character is actually a 32 bit number, the Charset is just an encoding of that 32 bit number. UTF-8 can be 1, 2, 3 or 4 bytes for example, and UTF-16 is 2,4 bytes with a bit specifying if the next byte(s) is part of the same character or not.

Upvotes: 2

Related Questions