Reputation: 148
I have a varbinary(max) column with UTF-8-encoded text that has been compressed. I would like to decompress this data and work with it in T-SQL as a varchar(max) using the UTF-8 capabilities of SQL Server.
I'm looking for a way of specifying the encoding when converting from varbinary(max) to varchar(max). The only way I've managed to do that is by creating a table variable with a column with a UTF-8 collation and inserting the varbinary data into it.
DECLARE @rv TABLE(
Res varchar(max) COLLATE Latin1_General_100_CI_AS_SC_UTF8
)
INSERT INTO @rv
SELECT SUBSTRING(Decompressed, 4, DATALENGTH(Decompressed) - 3) WithoutBOM
FROM
(SELECT DECOMPRESS(RawResource) AS Decompressed FROM Resource) t
I'm wondering if there is a more elegant and efficient approach that does not involve inserting into a table variable.
UPDATE:
Boiling this down to a simple example that doesn't deal with byte order marks or compression:
I have the string "Hello 😊" UTF-8 encoded without a BOM stored in variable @utf8Binary
DECLARE @utf8Binary varbinary(max) = 0x48656C6C6F20F09F988A
Now I try to assign that into various char-based variables and print the result:
DECLARE @brokenVarChar varchar(max) = CONVERT(varchar(max), @utf8Binary)
print '@brokenVarChar = ' + @brokenVarChar
DECLARE @brokenNVarChar nvarchar(max) = CONVERT(varchar(max), @utf8Binary)
print '@brokenNVarChar = ' + @brokenNVarChar
DECLARE @rv TABLE(
Res varchar(max) COLLATE Latin1_General_100_CI_AS_SC_UTF8
)
INSERT INTO @rv
select @utf8Binary
DECLARE @working nvarchar(max)
Select TOP 1 @working = Res from @rv
print '@working = ' + @working
The results of this are:
@brokenVarChar = Hello ðŸ˜Å
@brokenNVarChar = Hello ðŸ˜Å
@working = Hello 😊
So I am able to get the binary result properly decoded using this indirect method, but I am wondering if there is a more straightforward (and likely efficient) approach.
Upvotes: 13
Views: 6915
Reputation: 8049
You didn't say how your data is compressed or what compression algorithm was used. But if you are using the COMPRESS
function in SQL Server 2016 or later, you can use the DECOMPRESS
function and then cast it to a VARCHAR(MAX)
. Both COMPRESS
and DECOMPRESS
use the GZip compression algorithm.
This function will decompress an input expression value, using the GZIP algorithm. DECOMPRESS
will return a byte array (VARBINARY(MAX)
type).
CAST(DECOMPRESS([compressed content here]) AS VARCHAR(MAX))
See: COMPRESS (Transact-SQL) and DECOMPRESS (Transact-SQL)
Upvotes: -2
Reputation: 6808
DECLARE @utf8Binary varbinary(max) = 0x48656C6C6F20F09F988A;
DECLARE @brokenNVarChar nvarchar(max) = concat(@utf8Binary, '' COLLATE Latin1_General_100_CI_AS_SC_UTF8);
print '@brokenNVarChar = ' + @brokenNVarChar;
Upvotes: 6
Reputation: 67341
There is an undocumented hack:
DECLARE @utf8 VARBINARY(MAX)=0x48656C6C6F20F09F988A;
SELECT CAST(CONCAT('<?xml version="1.0" encoding="UTF-8" ?><![CDATA[',@utf8,']]>') AS XML)
.value('.','nvarchar(max)');
The result
Hello 😊
This works even in versions without the new UTF8 collations...
UPDATE: calling this as a function
This can easily be wrapped in a scalar function
CREATE FUNCTION dbo.Convert_UTF8_Binary_To_NVarchar(@utfBinary VARBINARY(MAX))
RETURNS NVARCHAR(MAX)
AS
BEGIN
RETURN
(
SELECT CAST(CONCAT('<?xml version="1.0" encoding="UTF-8" ?><![CDATA[',@utfBinary,']]>') AS XML)
.value('.','nvarchar(max)')
);
END
GO
Or like this as an inlined table valued function
CREATE FUNCTION dbo.Convert_UTF8_Binary_To_NVarchar(@utfBinary VARBINARY(MAX))
RETURNS TABLE
AS
RETURN
SELECT CAST(CONCAT('<?xml version="1.0" encoding="UTF-8" ?><![CDATA[',@utfBinary,']]>') AS XML)
.value('.','nvarchar(max)') AS ConvertedString
GO
This can be used after FROM
or - more appropriate - with APPLY
Upvotes: 10
Reputation: 96027
I don't like this solution, but it's one I got to (I initially thought it wasn't working, due to what appears to be a bug in ADS). One method would be to create a new database in a UTF8 collation, and then pass the value to a function in that database. As the database is in a UTF8 collation, the default collation will be different to the local one, and the correct result will be returned:
CREATE DATABASE UTF8 COLLATE Latin1_General_100_CI_AS_SC_UTF8;
GO
USE UTF8;
GO
CREATE OR ALTER FUNCTION dbo.Bin2UTF8 (@utfbinary varbinary(MAX))
RETURNS varchar(MAX) AS
BEGIN
RETURN CAST(@utfbinary AS varchar(MAX));
END
GO
USE YourDatabase;
GO
SELECT UTF8.dbo.Bin2UTF8(0x48656C6C6F20F09F988A);
This, however, isn't particularly "pretty".
Upvotes: 7