Reputation: 2515
I have a byte array that I need to store into a nvarchar
DB column. A nvarchar
takes 2 bytes. What is the optimal encoding?
Ideally I would store N bytes into a nvarchar
of lenght N/2, but there is invalid unicode sequences that worries me.
Upvotes: 0
Views: 200
Reputation: 390
Have a look at tables 3-6 and 3-7 in the Unicode spec (version 16):
Scalar Value | First Byte | Second Byte | Third Byte | Fourth Byte |
---|---|---|---|---|
00000000 0xxxxxxx | 0xxxxxxx | |||
00000yyy yyxxxxxx | 110yyyyy | 10xxxxxx | ||
zzzzyyyy yyxxxxxx | 1110zzzz | 10yyyyyy | 10xxxxxx | |
000uuuuu zzzzyyyy yyxxxxxx | 11110uuu | 10uuzzzz | 10yyyyyy | 10xxxxxx |
Table 3-6 specifies the bit distribution for the UTF-8 encoding form, showing the ranges of Unicode scalar values corresponding to one-, two-, three-, and four-byte sequences.
Code Points | First Byte | Second Byte | Third Byte | Fourth Byte |
---|---|---|---|---|
U+0000..U+007F | 00..7F | |||
U+0080..U+07FF | C2..DF | 80..BF | ||
U+0800..U+0FFF | E0 | A0..BF | 80..BF | |
U+1000..U+CFFF | E1..EC | 80..BF | 80..BF | |
U+D000..U+D7FF | ED | 80..9F | 80..BF | |
U+E000..U+FFFF | EE..EF | 80..BF | 80..BF | |
U+10000..U+3FFFF | F0 | 90..BF | 80..BF | 80..BF |
U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF |
U+100000..U+10FFFF | F4 | 80..8F | 80..BF | 80..BF |
Table 3-7 lists all of the byte sequences that are well-formed in UTF-8. A range of byte values such as A0..BF indicates that any byte from A0 to BF (inclusive) is well-formed in that position. Any byte value outside of the ranges listed is ill-formed.
In Table 3-7, cases where a trailing byte range is not 80..BF are shown in bold italic to draw attention to them. These exceptions to the general pattern occur only in the second byte of a sequence.
As long as you stay within those limitations, I believe you should be fine. If you only use it to store binary data that won't be displayed or exchanged with other systems, you don't have to worry about noncharacters, control characters and just weird characters that can mess things up. They would still be valid if you happened to produce them.
Upvotes: 0
Reputation: 9523
Usually Base64 is a good way, but you may use just Unicode code points.
Unicode codepoints go from 0 to 10FFFF, but you can encode easily and efficiently 2 bytes and an half into a Unicode code point. Depending on your requirements, you may shift all codepoints by 128, so that you have ASCII for boundaries (and you do not need to worry about byte 0, and still you have enough code points for the 20bit binary data (per code point). [Or maybe just escape 0 as 0x10000]
This is generic, for Unicode (so generic Unicode). If you know the encoding (e.g. UTF-8, you may choose different encoding).
Upvotes: 1
Reputation: 151594
The most optimal solution would be to store binary in a binary column. So you mean the most optimal encoding within the constraints of this suboptimal scenario?
Just go for base64, it's safe.
If you can't control the input bytes, you're bound to running into encoding problems sooner or later.
Upvotes: 2