Mark
Mark

Reputation: 656

base64 encode for chinese chars

I'm using both the following methods to encode in base 64 a Chinese string. Problem is that I'm having Pz8= as output, which decoded is ??

What's wrong with this and how can I fix it?

Method 1

CREATE FUNCTION [dbo].[base64Encode] (@input VARCHAR(MAX))
RETURNS NVARCHAR(MAX)
AS
BEGIN
    DECLARE @output NVARCHAR(MAX),
        @bits VARBINARY(3),
        @pos INT

    SET @pos = 1
    SET @output = ''

    WHILE @pos <= LEN(@input)
    BEGIN
        SET @bits = CONVERT(VARBINARY(3), SUBSTRING(@input, @pos, 3))
        SET @output = @output + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/', SUBSTRING(@bits, 1, 1) / 4 + 1, 1)
        SET @output = @output + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/', SUBSTRING(@bits, 1, 1) % 4 * 16 + SUBSTRING(@bits, 2, 1) / 16 + 1, 1)
        SET @output = @output + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/', SUBSTRING(@bits, 2, 1) % 16 * 4 + SUBSTRING(@bits, 3, 1) / 64 + 1, 1)
        SET @output = @output + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/', SUBSTRING(@bits, 3, 1) % 64 + 1, 1)
        SET @pos = @pos + 3
    END

    RETURN (LEFT(@output, LEN(@output) - 3 + LEN(@bits)) + REPLICATE('=', 3 - LEN(@bits)))
END

SELECT [dbo].[base64Encode]('你好')

Method 2

SELECT CAST('你好' as varbinary(max)) FOR XML PATH(''), BINARY BASE64

Upvotes: 0

Views: 1629

Answers (1)

Gottfried Lesigang
Gottfried Lesigang

Reputation: 67311

You are missing the N to mark a string literal as unicode:

 SELECT N'你好' AS unicode
       ,'你好' AS ASCII

Try this to get a base64 out of your chinese charcters and vice versa:

SELECT (SELECT CAST(N'你好' AS VARBINARY(MAX)) FOR XML PATH(''),TYPE).value(N'.','nvarchar(max)');

You get this base64 result: YE99WQ==

This is the way to re-convert the base64 to the original value

SELECT CAST(CAST('<x>' + 'YE99WQ==' + '</x>' AS XML).value('.','varbinary(max)') AS NVARCHAR(MAX));

UPDATE Some words about the re-encoding

base64 does not encode a string value, but the binary pattern a system uses to keep that string in memory (this is valid for any data type actually). The bit pattern of a string differs with UTF-8, UTF-16, ASCII whatever... And even worse there is BE and LE.

The steps to get base64 are:

  • Get the bit pattern of my value (a string, a date, a picture, any value actually)
  • compute the base64 for this bit pattern

The steps for the re-encoding are

  • Compute the original bit pattern which is hidden within the base64
  • Interpret the bit pattern as the original value

The very last step might bring up confusion... You have to know exactly which binary representation a system uses. You have to use exactly the same data type with exactly the same interpretation to get the values back.

With strings one has to know, that SQL-Server works with a very limited choice natively.

  • There is NVARCHAR (NCHAR), which is 2-byte encoded unicode in UCS-2 flavour (almost the same as utf-16)
  • And there is VARCHAR (CHAR), which is 1-byte encoded extended ASCII. All non-latin characters are bound to a code page within the connected collation. But this is not UTF-8!

Upvotes: 2

Related Questions