Reputation: 1624
Suppose there's a hex string of emoji character like "1f1e81f1f3
", it's unwell-formed hex string of code point of an emoji character, and it's supposed to be two string like 1f1e8
1f1f3
I'm using org.apache.commons.codec.binary.Hex
to decode hex string, but obviously Hex need the length of input string be even, so I need to make the hex string in zero padding style like "01f1e8
01f1f3
".
Currently, I simply replace "1f" with "01f", so far so good, but since an emoji glyph may contains a sequence of unicode characters, so
This hex string of emoji character is stripped from "<span class="emoji emojiXXXXXXXXXX"></span>
" string, it's a text message retrieved from a popular IM software via unofficial HTTP API.
Upvotes: 1
Views: 452
Reputation: 1624
I ends up with writing a small function to restore emoji characters.
Basic procedure:
1f
", then pad three zeroes before "1f
", store it to a new hex string, then pointer step to next 5th position. Otherwise, no zero padding is made, store the sub string to a new hex string, and pointer step to the next 4th position.It works, but it's not perfect, it could introduce bug if
1f
", or the length of it's hex string is not 5.Code snippet:
import java.util.*;
import java.util.regex.*;
import org.apache.commons.codec.*;
import org.apache.commons.codec.binary.Hex;
import org.apache.commons.lang3.*;
public static final Charset UTF_32BE = Charset.forName ("UTF-32BE");
public static final String REGEXP_FindTransformedEmojiHexString = "<span class=\"emoji emoji(\\p{XDigit}+)\"></span>";
public static final Pattern PATTERN_FindTransformedEmojiHexString = Pattern.compile (REGEXP_FindTransformedEmojiHexString, Pattern.CASE_INSENSITIVE);
public static String RestoreEmojiCharacters (String sContent)
{
bMatched = true;
String sEmojiHexString = matcher.group(1);
Hex hex = new Hex (StandardCharsets.ISO_8859_1);
try
{
for (int i=0; i<sEmojiHexString.length ();)
{
String sEmoji = null;
Charset charset = null;
String sSingleEmojiGlyphHexString = null;
String sStartString = StringUtils.substring (sEmojiHexString, i, i+2);
if (StringUtils.startsWithIgnoreCase (sStartString, "1f"))
{
sSingleEmojiGlyphHexString = "000" + StringUtils.substring (sEmojiHexString, i, i+5);
i += 5;
charset = UTF_32BE;
}
else
{
sSingleEmojiGlyphHexString = StringUtils.substring (sEmojiHexString, i, i+4);
i += 4;
charset = StandardCharsets.UTF_16BE;
}
byte[] arrayEmoji = null;
arrayEmoji = (byte[])hex.decode (sSingleEmojiGlyphHexString);
sEmoji = new String (arrayEmoji, charset);
matcher.appendReplacement (sbReplace, sEmoji);
}
}
catch (DecoderException e)
{
e.printStackTrace();
}
}
matcher.appendTail (sbReplace);
if (bMatched)
sContent = sbReplace.toString ();
return sContent;
}
Upvotes: 0