Przemyslaw Remin
Przemyslaw Remin

Reputation: 6940

Convert UTF <U+something> to nvarchar

I have this strange storage of UTF string:

<U+0410><U+043B><U+044C><U+043A><U+0430>

How can I convert it back to nvarchar? The string above should be converted to Алька

Update. Here is more sample data:

+-------------------------------------------------------------------------------------+-----------------+
|                                   Column1_encoded                                   | Column1_decoded |
+-------------------------------------------------------------------------------------+-----------------+
| <U+0410><U+043B><U+044C><U+043A><U+0430>                                            | Алька           |
| ABC <U+0410><U+043B><U+044C><U+043A><U+0430> 1                                      | ABC Алька 1     |
| <U+0410><U+043B> 2 <U+044C><U+043A><U+0430>                                         | Ал 2 ька        |
| <U+0410><U+043B><U+044C><U+043A><U+0430> 3 <U+0410><U+043B><U+044C><U+043A><U+0430> | Алька 3 Алька   |
+-------------------------------------------------------------------------------------+-----------------+

I got this strange format while sending data from Power BI to SQL Server via R transform exactly this way: https://stackoverflow.com/a/51386029/1903793

The answer of Jeroen Mostert in comment seem to handle it. Thank you.

Upvotes: 0

Views: 263

Answers (1)

iamdave
iamdave

Reputation: 12243

In order to use this across multiple column values you will need to convert it to a table valued function and call it via cross apply, though I am sure you can manage that yourself. Explanation is in the comments:

declare @str nvarchar(1000) = '<U+0410><U+043B><U+044C><U+043A><U+0430> This is a string with <U+0410><U+043B><U+044C><U+043A><U+0430> not encoded as we would like <U+0410><U+043B><U+044C><U+043A><U+0430>';

-- Add an additional > character before the first < character to act as the first delimiter
-- and then insert a delimiting > character before any instances of a < chracter that follow a space to ensure the character code is properly parsed out.
select @str = replace(stuff(@str,charindex('<',@str,1),0,'>'),' <',' ><');

                -- Start tally table with 10 rows.
with n(n)   as (select n from (values(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n(n))
                -- Select the same number of rows as characters in @str as incremental row numbers.
                -- Cross joins increase exponentially to a max possible 10,000 rows to cover largest @str length.
    ,t(t)   as (select top (select len(@str) a) row_number() over (order by (select null)) from n n1,n n2,n n3,n n4)
                -- Return the position of every value that follows the specified delimiter.
    ,s(s)   as (select 1 union all select t+1 from t where substring(@str,t,1) = '>')
                -- Return the start and length of every value, to use in the SUBSTRING function.
                -- ISNULL/NULLIF combo handles the last value where there is no delimiter at the end of the string.
    ,l(s,l) as (select s,isnull(nullif(charindex('>',@str,s),0)-s,4000) from s)
    ,r      as (select rn as ItemNumber
                    ,Item
                from(select row_number() over(order by s) as rn
                            ,substring(@str,s,l) as item
                    from l
                    ) a
                where Item <> ''
                )
select cast((select case when left(Item,3) = '<U+'  -- Where required, convert the Unicode number into a character using the NCHAR function
                        then nchar(convert(nvarchar(500),convert(int,(convert(varbinary(max),replace(Item,'<U+','0x0000'),1)))))
                        else Item
                        end
            from r
            order by ItemNumber
            for xml path('')
            ) as nvarchar(max)) as String;

Output:

+----------------------------------------------------------------------+
| String                                                               |
+----------------------------------------------------------------------+
| Алька This is a string with Алька not encoded as we would like Алька |
+----------------------------------------------------------------------+

Upvotes: 1

Related Questions