Justin A. Ferrell
Justin A. Ferrell

Reputation: 1

How to detect which Apple Emoji is used in text?

I have a txt file of a WhatsApp chat and I want to parse it using Java.

But all emojis used is displayed as "‬😂" in the txt file. I wanted to try and findout how to learn which emoji is actually used and tried this:

‬System.out.print( "\\u" + Integer.toHexString(line.charAt(i) | 0x10000).substring(1) );

But it displays a wrong unicode such as \ud83d etc.

I also got this list but I don't know exactly how to use it: http://grumdrig.com/emoji-list/#

Upvotes: 0

Views: 364

Answers (1)

Mr. Polywhirl
Mr. Polywhirl

Reputation: 48693

The \uD83D is part of a surrogate paring with \uDE04 which is actually encoded together to produce\u0001F604.

U+1F604 (U+D83D U+DE04) produces the SMILING FACE WITH OPEN MOUTH AND SMILING EYES emoji -> 😄

This Gist (mranney/emoji_sad.txt) might be a starting point for figuring out how to parse your files.

Example

You could possibly port some of this JavaScript to Java.

import java.util.stream.Collectors;
import java.util.stream.IntStream;

public class Main {
    public static void main(String[] args) {
        long codepoint = 0x1f600;
        int[] pair = findSurrogatePair(codepoint);

        System.out.printf("%s -> %s%n", toHex(codepoint),
            IntStream.of(pair).mapToObj(v -> toHex(v))
                .collect(Collectors.joining(" + ")));
    }

    /**
     * Assumes point > 0xFFFF
     * <p>
     * 
     * @param point Unicode codepoint to convert to surrogate pairs.
     * @return Returns the surrogate pairing for the input code-point.
     */
    public static int[] findSurrogatePair(final long point) {
        long offset = point - 0x10000;

        int lead = (int) (0xD800 + (offset >> 10));
        int trail = (int) (0xDC00 + (offset & 0x3FF));

        return new int[] { lead, trail };
    }

    public static String toHex(Number value) {
        return String.format("\\u%X", value);
    }
}

Output

\u1F600 -> \uD83D + \uDE00

Upvotes: 1

Related Questions