Kris van der Mast
Kris van der Mast

Reputation: 16613

T-SQL Transform unicode to emoji

I have a sequence of unicode like this:

\U0001F468\U0000200D\U0001F466

How to transform that in T-SQL to the corresponding emoji šŸ‘Øā€šŸ‘¦ and insert the emoji in an nvchar type column?

Upvotes: 3

Views: 1160

Answers (1)

iamdave
iamdave

Reputation: 12243

Hooooookay so I know hardly anything about this type of thing, but this was a fun distraction from work that I hope helps you get where you need to be.

That emoji character, as you can see from your unicode references is actually three characters all mashed together. The first is for MAN, the second for ZERO WIDTH JOINER and the third is BOY. The effect of the zero width joiner is to make the other two characters act as one when moving around the page or selecting text. You can see this falling down in any text editor that isn't supportive (such as SSMS) in which your cursor will 'stall' between the MAN and BOY characters for one directional key press.

So, in order to answer your question I have assumed that either all your unicode values are sequences of three and that the middle one is a joiner or, if this is not the case, that you will be able to work things out on your own from here.

Starting at this very informative Stack Overflow answer you will see that SQL Server has a bit of an incomplete handling of supplementary characters. Consequently, you need to either change your database collation or give it a helping hand, namely in letting it know whether or not you need to break down the unicode character into two nchar characters or not. Because I am assuming your sequences are all Emoji-Joiner-Emoji this isn't too much of a problem for me, but may be for you.

Firstly, we need to split your character sequence into its constituent parts, for which I am using a tally table string splitting function based on Jeff Moden's:

create function [dbo].[StringSplit]
(
    @str nvarchar(4000) = ' '               -- String to split.
    ,@delimiter as nvarchar(1) = ','        -- Delimiting value to split on.
    ,@num as int = null                     -- Which value to return.
)
returns @results table(ItemNumber int, Item nvarchar(4000))
as
begin
    declare @return nvarchar(4000);
    -- Handle null @str values
    select @str = case when len(isnull(@str,'')) = 0 then '' else @str end;
                    -- Start tally table with 10 rows.
    with n(n)   as (select n from (values(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n(n))
                    -- Select the same number of rows as characters in @str as incremental row numbers.
                    -- Cross joins increase exponentially to a max possible 10,000 rows to cover largest @str length.
        ,t(t)   as (select top (select len(@str) a) row_number() over (order by (select null)) from n n1,n n2,n n3,n n4)
                    -- Return the position of every value that follows the specified delimiter.
        ,s(s)   as (select 1 union all select t+1 from t where substring(@str,t,1) = @delimiter)
                    -- Return the start and length of every value, to use in the SUBSTRING function.
                    -- ISNULL/NULLIF combo handles the last value where there is no delimiter at the end of the string.
        ,l(s,l) as (select s,isnull(nullif(charindex(@delimiter,@str,s),0)-s,4000) from s)
    insert into @results
    select rn as ItemNumber
            ,Item
    from(select row_number() over(order by s) as rn
                ,substring(@str,s,l) as item
        from l
        ) a
    where rn = @num
        or @num is null;
    return;
end

Using this function, we can split the unicode sequence into 3 parts and manually pivot the data into 3 columns. Following the explanation in the SO answer linked above, because the CodePoint value of the two emojis (calculated using the convert(int,(convert(varbinary(max),replace('<Your Uxxxxxxxxx unicode value>','U','0x'),1))) part of the script below) is between 65536 and 1114111 we need to find the High Surrogate and the Low Surrogate, but as this is not necessary for the Zero Width Joiner we just need the binary representation to pass to the nchar function (Note the lack of conversion to int):

declare @s nvarchar(50) = '\U0001F468\U0000200D\U0001F466';

select nchar(55232+(i1/1024)) + nchar(56320+(i1%1024))  -- MAN emoji
      +nchar(b2)                                        -- JOINER
      +nchar(55232+(i3/1024)) + nchar(56320+(i3%1024))  -- BOY emoji
      as Emoji
from(select convert(int,(convert(varbinary(max),replace(s1.Item,'U','0x'),1))) as i1
           ,convert(varbinary(max),replace(s2.Item,'U','0x'),1) as b2
           ,convert(int,(convert(varbinary(max),replace(s3.Item,'U','0x'),1))) as i3
     from stringsplit(@s,'\',2) as s1
         ,stringsplit(@s,'\',3) as s2
         ,stringsplit(@s,'\',4) as s3
    ) as a;

By concatenating all these nchar values together, we end up with the correct character representation of your emoji:

Output

Emoji
-----
šŸ‘Øā€šŸ‘¦

Upvotes: 3

Related Questions