LiuYan 刘研
LiuYan 刘研

Reputation: 1624

How to decode unwell-formed hex string of emoji character like "`1f1e81f1f3`"?

Suppose there's a hex string of emoji character like "1f1e81f1f3", it's unwell-formed hex string of code point of an emoji character, and it's supposed to be two string like 1f1e8 1f1f3

I'm using org.apache.commons.codec.binary.Hex to decode hex string, but obviously Hex need the length of input string be even, so I need to make the hex string in zero padding style like "01f1e801f1f3".

Currently, I simply replace "1f" with "01f", so far so good, but since an emoji glyph may contains a sequence of unicode characters, so

Background

This hex string of emoji character is stripped from "<span class="emoji emojiXXXXXXXXXX"></span>" string, it's a text message retrieved from a popular IM software via unofficial HTTP API.

Upvotes: 1

Views: 452

Answers (1)

LiuYan 刘研
LiuYan 刘研

Reputation: 1624

I ends up with writing a small function to restore emoji characters.

Basic procedure:

  1. Make a pointer to the start of the hex string.
  2. Search from the the pointer position of the hex string,
    • If it's starts with "1f", then pad three zeroes before "1f", store it to a new hex string, then pointer step to next 5th position. Otherwise, no zero padding is made, store the sub string to a new hex string, and pointer step to the next 4th position.
    • Decode the new hex string to byte array.
    • Create new String using UTF_32BE or UTF_16BE character encoding from the byte array.
  3. Loop to step 2, until end of the hex string.

It works, but it's not perfect, it could introduce bug if

  • One character of emoji character sequence is located in supplementary character
  • And
  • It's hex string does not starts with "1f", or the length of it's hex string is not 5.

Code snippet:

import java.util.*;
import java.util.regex.*;

import org.apache.commons.codec.*;
import org.apache.commons.codec.binary.Hex;
import org.apache.commons.lang3.*;

public static final Charset UTF_32BE = Charset.forName ("UTF-32BE");
public static final String REGEXP_FindTransformedEmojiHexString = "<span class=\"emoji emoji(\\p{XDigit}+)\"></span>";
public static final Pattern PATTERN_FindTransformedEmojiHexString = Pattern.compile (REGEXP_FindTransformedEmojiHexString, Pattern.CASE_INSENSITIVE);
public static String RestoreEmojiCharacters (String sContent)
{
        bMatched = true;
        String sEmojiHexString = matcher.group(1);

        Hex hex = new Hex (StandardCharsets.ISO_8859_1);
        try
        {
            for (int i=0; i<sEmojiHexString.length ();)
            {
                String sEmoji = null;
                Charset charset = null;
                String sSingleEmojiGlyphHexString = null;
                String sStartString = StringUtils.substring (sEmojiHexString, i, i+2);
                if (StringUtils.startsWithIgnoreCase (sStartString, "1f"))
                {
                    sSingleEmojiGlyphHexString = "000" + StringUtils.substring (sEmojiHexString, i, i+5);
                    i += 5;
                    charset = UTF_32BE;
                }
                else
                {
                    sSingleEmojiGlyphHexString = StringUtils.substring (sEmojiHexString, i, i+4);
                    i += 4;
                    charset = StandardCharsets.UTF_16BE;
                }
                byte[] arrayEmoji = null;
                arrayEmoji = (byte[])hex.decode (sSingleEmojiGlyphHexString);
                sEmoji = new String (arrayEmoji, charset);
                matcher.appendReplacement (sbReplace, sEmoji);
            }
        }
        catch (DecoderException e)
        {
            e.printStackTrace();
        }
    }
    matcher.appendTail (sbReplace);

    if (bMatched)
        sContent = sbReplace.toString ();

    return sContent;
}

Upvotes: 0

Related Questions