milan
milan

Reputation: 2515

Encoding binary into unicode

I have a byte array that I need to store into a nvarchar DB column. A nvarchar takes 2 bytes. What is the optimal encoding?

Ideally I would store N bytes into a nvarchar of lenght N/2, but there is invalid unicode sequences that worries me.

Upvotes: 0

Views: 200

Answers (3)

ximo
ximo

Reputation: 390

Have a look at tables 3-6 and 3-7 in the Unicode spec (version 16):

Table 3-6. UTF-8 Bit Distribution
Scalar Value First Byte Second Byte Third Byte Fourth Byte
00000000 0xxxxxxx 0xxxxxxx
00000yyy yyxxxxxx 110yyyyy 10xxxxxx
zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

Table 3-6 specifies the bit distribution for the UTF-8 encoding form, showing the ranges of Unicode scalar values corresponding to one-, two-, three-, and four-byte sequences.

Table 3-7. Well-Formed UTF-8 Byte Sequences
Code Points First Byte Second Byte Third Byte Fourth Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF

Table 3-7 lists all of the byte sequences that are well-formed in UTF-8. A range of byte values such as A0..BF indicates that any byte from A0 to BF (inclusive) is well-formed in that position. Any byte value outside of the ranges listed is ill-formed.

In Table 3-7, cases where a trailing byte range is not 80..BF are shown in bold italic to draw attention to them. These exceptions to the general pattern occur only in the second byte of a sequence.

As long as you stay within those limitations, I believe you should be fine. If you only use it to store binary data that won't be displayed or exchanged with other systems, you don't have to worry about noncharacters, control characters and just weird characters that can mess things up. They would still be valid if you happened to produce them.

Upvotes: 0

Giacomo Catenazzi
Giacomo Catenazzi

Reputation: 9523

Usually Base64 is a good way, but you may use just Unicode code points.

Unicode codepoints go from 0 to 10FFFF, but you can encode easily and efficiently 2 bytes and an half into a Unicode code point. Depending on your requirements, you may shift all codepoints by 128, so that you have ASCII for boundaries (and you do not need to worry about byte 0, and still you have enough code points for the 20bit binary data (per code point). [Or maybe just escape 0 as 0x10000]

This is generic, for Unicode (so generic Unicode). If you know the encoding (e.g. UTF-8, you may choose different encoding).

Upvotes: 1

CodeCaster
CodeCaster

Reputation: 151594

The most optimal solution would be to store binary in a binary column. So you mean the most optimal encoding within the constraints of this suboptimal scenario?

Just go for base64, it's safe.

If you can't control the input bytes, you're bound to running into encoding problems sooner or later.

Upvotes: 2

Related Questions