Reputation: 2515

Encoding binary into unicode

I have a byte array that I need to store into a nvarchar DB column. A nvarchar takes 2 bytes. What is the optimal encoding?

Ideally I would store N bytes into a nvarchar of lenght N/2, but there is invalid unicode sequences that worries me.

Upvotes: 0

Answers (3)

ximo

Reputation: 390

Have a look at tables 3-6 and 3-7 in the Unicode spec (version 16):

Table 3-6. UTF-8 Bit Distribution

Scalar Value	First Byte	Second Byte	Third Byte	Fourth Byte
00000000 0xxxxxxx	0xxxxxxx
00000yyy yyxxxxxx	110yyyyy	10xxxxxx
zzzzyyyy yyxxxxxx	1110zzzz	10yyyyyy	10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx	11110uuu	10uuzzzz	10yyyyyy	10xxxxxx

Table 3-6 specifies the bit distribution for the UTF-8 encoding form, showing the ranges of Unicode scalar values corresponding to one-, two-, three-, and four-byte sequences.

Table 3-7. Well-Formed UTF-8 Byte Sequences

Code Points	First Byte	Second Byte	Third Byte	Fourth Byte
U+0000..U+007F	00..7F
U+0080..U+07FF	C2..DF	80..BF
U+0800..U+0FFF	E0	A0..BF	80..BF
U+1000..U+CFFF	E1..EC	80..BF	80..BF
U+D000..U+D7FF	ED	80..9F	80..BF
U+E000..U+FFFF	EE..EF	80..BF	80..BF
U+10000..U+3FFFF	F0	90..BF	80..BF	80..BF
U+40000..U+FFFFF	F1..F3	80..BF	80..BF	80..BF
U+100000..U+10FFFF	F4	80..8F	80..BF	80..BF

Table 3-7 lists all of the byte sequences that are well-formed in UTF-8. A range of byte values such as A0..BF indicates that any byte from A0 to BF (inclusive) is well-formed in that position. Any byte value outside of the ranges listed is ill-formed.

In Table 3-7, cases where a trailing byte range is not 80..BF are shown in bold italic to draw attention to them. These exceptions to the general pattern occur only in the second byte of a sequence.

As long as you stay within those limitations, I believe you should be fine. If you only use it to store binary data that won't be displayed or exchanged with other systems, you don't have to worry about noncharacters, control characters and just weird characters that can mess things up. They would still be valid if you happened to produce them.

Upvotes: 0

Giacomo Catenazzi

Reputation: 9523

Usually Base64 is a good way, but you may use just Unicode code points.

Unicode codepoints go from 0 to 10FFFF, but you can encode easily and efficiently 2 bytes and an half into a Unicode code point. Depending on your requirements, you may shift all codepoints by 128, so that you have ASCII for boundaries (and you do not need to worry about byte 0, and still you have enough code points for the 20bit binary data (per code point). [Or maybe just escape 0 as 0x10000]

This is generic, for Unicode (so generic Unicode). If you know the encoding (e.g. UTF-8, you may choose different encoding).

Upvotes: 1

CodeCaster

Reputation: 151594

The most optimal solution would be to store binary in a binary column. So you mean the most optimal encoding within the constraints of this suboptimal scenario?

Just go for base64, it's safe.

If you can't control the input bytes, you're bound to running into encoding problems sooner or later.

Upvotes: 2

Encoding binary into unicode

Answers (3)

Table 3-6. UTF-8 Bit Distribution

Table 3-7. Well-Formed UTF-8 Byte Sequences

Related Questions