Reputation: 1
I have a txt file of a WhatsApp chat and I want to parse it using Java.
But all emojis used is displayed as "😂" in the txt file. I wanted to try and findout how to learn which emoji is actually used and tried this:
System.out.print( "\\u" + Integer.toHexString(line.charAt(i) | 0x10000).substring(1) );
But it displays a wrong unicode such as \ud83d etc.
I also got this list but I don't know exactly how to use it: http://grumdrig.com/emoji-list/#
Upvotes: 0
Views: 364
Reputation: 48693
The \uD83D
is part of a surrogate paring with \uDE04
which is actually encoded together to produce\u0001F604
.
U+1F604 (U+D83D U+DE04)
produces the SMILING FACE WITH OPEN MOUTH AND SMILING EYES emoji ->
😄
This Gist (mranney/emoji_sad.txt) might be a starting point for figuring out how to parse your files.
You could possibly port some of this JavaScript to Java.
import java.util.stream.Collectors;
import java.util.stream.IntStream;
public class Main {
public static void main(String[] args) {
long codepoint = 0x1f600;
int[] pair = findSurrogatePair(codepoint);
System.out.printf("%s -> %s%n", toHex(codepoint),
IntStream.of(pair).mapToObj(v -> toHex(v))
.collect(Collectors.joining(" + ")));
}
/**
* Assumes point > 0xFFFF
* <p>
*
* @param point Unicode codepoint to convert to surrogate pairs.
* @return Returns the surrogate pairing for the input code-point.
*/
public static int[] findSurrogatePair(final long point) {
long offset = point - 0x10000;
int lead = (int) (0xD800 + (offset >> 10));
int trail = (int) (0xDC00 + (offset & 0x3FF));
return new int[] { lead, trail };
}
public static String toHex(Number value) {
return String.format("\\u%X", value);
}
}
\u1F600 -> \uD83D + \uDE00
Upvotes: 1