bdv
bdv

Reputation: 1204

Check if letter is emoji

I want to check if a letter is a emoji. I've found some similiar questions on so and found this regex:

private final String emo_regex = "([\\u20a0-\\u32ff\\ud83c\\udc00-\\ud83d\\udeff\\udbb9\\udce5-\\udbb9\\udcee])";

However, when I do the following in a sentence like:

for (int k=0; k<letters.length;k++) {    
    if (letters[k].matches(emo_regex)) {
        emoticon.add(letters[k]);
    }
}

It doesn't add any letters with any emoji. I've also tried with a Matcher and a Pattern, but that didn't work either. Is there something wrong with the regex or am I missing something obvious in my code?

This is how I get the letter:

sentence = "Jij staat op 10 πŸ˜‚"
String[] letters = sentence.split("");

The last πŸ˜‚ should be recognized and added to emoticon

Upvotes: 13

Views: 31805

Answers (11)

Terran
Terran

Reputation: 1153

Java 21 added Character::isEmoji (JavaDoc). For example:

String sentence = "This string contains 1 emoji πŸ˜‚!";
sentence.codePoints()
    .filter(Character::isEmoji)
    .filter(emoji -> !Character.isEmojiComponent(emoji))
    .mapToObj(Character::toString)
    .forEach(System.out::println);

These new methods can also be accessed in regex through property constructs:

Pattern.compile("\\p{IsEmoji}")

Upvotes: 1

J. Hill
J. Hill

Reputation: 515

Here's some java logic that relies on java.lang.Character api that I have found pretty reliably tells apart an emoji from mere 'special characters' and non-latin alphabets. Give it a try.

import static java.lang.Character.UnicodeBlock.MISCELLANEOUS_SYMBOLS;
import static java.lang.Character.UnicodeBlock.MISCELLANEOUS_TECHNICAL;
import static java.lang.Character.UnicodeBlock.VARIATION_SELECTORS;
import static java.lang.Character.codePointAt;
import static java.lang.Character.codePointBefore;
import static java.lang.Character.isSupplementaryCodePoint;
import static java.lang.Character.isValidCodePoint;

public boolean checkStringEmoji(String someString) {
    if(!someString.isEmpty() && someString.length() < 5) {
        int firstCodePoint = codePointAt(someString, 0);
        int lastCodePoint = codePointBefore(someString, someString.length());
        if (isValidCodePoint(firstCodePoint) && isValidCodePoint(lastCodePoint)) {
            if (isSupplementaryCodePoint(firstCodePoint) ||
                isSupplementaryCodePoint(lastCodePoint) ||
                Character.UnicodeBlock.of(firstCodePoint) == MISCELLANEOUS_SYMBOLS ||
                Character.UnicodeBlock.of(firstCodePoint) == MISCELLANEOUS_TECHNICAL ||
                Character.UnicodeBlock.of(lastCodePoint) == VARIATION_SELECTORS
            ) {
                // string is emoji
                return true;
            }
        }
    }
    return false;
}

Upvotes: 1

VGR
VGR

Reputation: 44404

Unicode has an entire document on this. Emojis and emoji sequences are a lot more complicated than just a few character ranges. There are emoji modifiers (for example, skin tones), regional indicator pairs (country flags), and some special sequences like the pirate flag.

You can use Unicode’s emoji data files to reliably find emoji characters and emoji sequences. This will work even as new complex emojis are added:

import java.net.URL;

import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.BufferedReader;
import java.io.IOException;

import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

import java.util.Collection;
import java.util.ArrayList;
import java.util.Scanner;

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class EmojiCollector {
    private static String emojiSequencesBaseURI;

    private final Pattern emojiPattern;

    public EmojiCollector()
    throws IOException {
        StringBuilder sequences = new StringBuilder();

        appendSequencesFrom(
            uriOfEmojiSequencesFile("emoji-sequences.txt"),
            sequences);
        appendSequencesFrom(
            uriOfEmojiSequencesFile("emoji-zwj-sequences.txt"),
            sequences);

        emojiPattern = Pattern.compile(sequences.toString());
    }

    private void appendSequencesFrom(String sequencesFileURI,
                                     StringBuilder sequences)
    throws IOException {
        Path sequencesFile = download(sequencesFileURI);

        Pattern range =
            Pattern.compile("^(\\p{XDigit}{4,6})\\.\\.(\\p{XDigit}{4,6})");
        Matcher rangeMatcher = range.matcher("");

        try (BufferedReader sequencesReader =
            Files.newBufferedReader(sequencesFile)) {

            String line;
            while ((line = sequencesReader.readLine()) != null) {
                if (line.trim().isEmpty() || line.startsWith("#")) {
                    continue;
                }

                int semicolon = line.indexOf(';');
                if (semicolon < 0) {
                    continue;
                }

                String codepoints = line.substring(0, semicolon);

                if (sequences.length() > 0) {
                    sequences.append("|");
                }

                if (rangeMatcher.reset(codepoints).find()) {
                    String start = rangeMatcher.group(1);
                    String end = rangeMatcher.group(2);

                    sequences.append("[\\x{").append(start).append("}");
                    sequences.append("-\\x{").append(end).append("}]");
                } else {
                    Scanner scanner = new Scanner(codepoints);
                    while (scanner.hasNext()) {
                        String codepoint = scanner.next();
                        sequences.append("\\x{").append(codepoint).append("}");
                    }
                }
            }
        }
    }

    private static String uriOfEmojiSequencesFile(String baseName)
    throws IOException {
        if (emojiSequencesBaseURI == null) {
            URL readme = new URL(
                "https://www.unicode.org/Public/UCD/latest/ReadMe.txt");
            try (BufferedReader reader = new BufferedReader(
                new InputStreamReader(readme.openStream(), "UTF-8"))) {

                String line;
                while ((line = reader.readLine()) != null) {
                    if (line.startsWith("Public/emoji/")) {
                        emojiSequencesBaseURI =
                            "https://www.unicode.org/" + line.trim();
                        if (!emojiSequencesBaseURI.endsWith("/")) {
                            emojiSequencesBaseURI += "/";
                        }
                        break;
                    }
                }
            }

            if (emojiSequencesBaseURI == null) {
                // Where else can we get this reliably?
                String version = "15.0";

                emojiSequencesBaseURI =
                    "https://www.unicode.org/Public/emoji/" + version + "/";
            }
        }

        return emojiSequencesBaseURI + baseName;
    }

    private static Path download(String uri)
    throws IOException {

        Path cacheDir;
        String os = System.getProperty("os.name");
        String home = System.getProperty("user.home");
        if (os.contains("Windows")) {
            Path appDataDir;
            String appData = System.getenv("APPDATA");
            if (appData != null) {
                appDataDir = Paths.get(appData);
            } else {
                appDataDir = Paths.get(home, "AppData");
            }

            cacheDir = appDataDir.resolve("Local");
        } else if (os.contains("Mac")) {
            cacheDir = Paths.get(home, "Library", "Application Support");
        } else {
            cacheDir = Paths.get(home, ".cache");
            String cacheHome = System.getenv("XDG_CACHE_HOME");
            if (cacheHome != null) {
                Path dir = Paths.get(cacheHome);
                if (dir.isAbsolute()) {
                    cacheDir = dir;
                }
            }
        }

        String baseName = uri.substring(uri.lastIndexOf('/') + 1);

        Path dataDir = cacheDir.resolve(EmojiCollector.class.getName());
        Path dataFile = dataDir.resolve(baseName);
        if (!Files.isReadable(dataFile)) {
            Files.createDirectories(dataDir);
            URL dataURL = new URL(uri);
            try (InputStream data = dataURL.openStream()) {
                Files.copy(data, dataFile);
            }
        }

        return dataFile;
    }

    public Collection<String> getEmojisIn(String letters) {
        Collection<String> emoticons = new ArrayList<>();

        Matcher emojiMatcher = emojiPattern.matcher(letters);
        while (emojiMatcher.find()) {
            emoticons.add(emojiMatcher.group());
        }

        return emoticons;
    }

    public static void main(String[] args)
    throws IOException {
        EmojiCollector collector = new EmojiCollector();

        for (String arg : args) {
            Collection<String> emojis = collector.getEmojisIn(arg);
            System.out.println(arg + " => " + String.join("", emojis));
        }
    }
}

Upvotes: 1

mathematics-and-caffeine
mathematics-and-caffeine

Reputation: 2207

This is how Telegram does it:

private static boolean isEmoji(String message){
    return message.matches("(?:[\uD83C\uDF00-\uD83D\uDDFF]|[\uD83E\uDD00-\uD83E\uDDFF]|" +
        "[\uD83D\uDE00-\uD83D\uDE4F]|[\uD83D\uDE80-\uD83D\uDEFF]|" +
        "[\u2600-\u26FF]\uFE0F?|[\u2700-\u27BF]\uFE0F?|\u24C2\uFE0F?|" +
        "[\uD83C\uDDE6-\uD83C\uDDFF]{1,2}|" +
        "[\uD83C\uDD70\uD83C\uDD71\uD83C\uDD7E\uD83C\uDD7F\uD83C\uDD8E\uD83C\uDD91-\uD83C\uDD9A]\uFE0F?|" +
        "[\u0023\u002A\u0030-\u0039]\uFE0F?\u20E3|[\u2194-\u2199\u21A9-\u21AA]\uFE0F?|[\u2B05-\u2B07\u2B1B\u2B1C\u2B50\u2B55]\uFE0F?|" +
        "[\u2934\u2935]\uFE0F?|[\u3030\u303D]\uFE0F?|[\u3297\u3299]\uFE0F?|" +
        "[\uD83C\uDE01\uD83C\uDE02\uD83C\uDE1A\uD83C\uDE2F\uD83C\uDE32-\uD83C\uDE3A\uD83C\uDE50\uD83C\uDE51]\uFE0F?|" +
        "[\u203C\u2049]\uFE0F?|[\u25AA\u25AB\u25B6\u25C0\u25FB-\u25FE]\uFE0F?|" +
        "[\u00A9\u00AE]\uFE0F?|[\u2122\u2139]\uFE0F?|\uD83C\uDC04\uFE0F?|\uD83C\uDCCF\uFE0F?|" +
        "[\u231A\u231B\u2328\u23CF\u23E9-\u23F3\u23F8-\u23FA]\uFE0F?)+");
}

It is Line 21,026 in ChatActivity.

Upvotes: 2

Aarush Kumar
Aarush Kumar

Reputation: 167

Here you go -

for (String word : sentence.split("")) {
    if (word.matches(emo_regex)) {
        System.out.println(word);
    }
}

Upvotes: 0

coder4
coder4

Reputation: 319

Try this project simple-emoji-4j

Compatible with Emoji 12.0 (2018.10.15)

Simple with:

EmojiUtils.containsEmoji(str)

Upvotes: 4

Noamaw
Noamaw

Reputation: 123

This function I created checks if given String consists of only emojis. in other words if the String contains any character not included in the Regex, it will return false.

private static boolean isEmoji(String message){
    return message.matches("(?:[\uD83C\uDF00-\uD83D\uDDFF]|[\uD83E\uDD00-\uD83E\uDDFF]|" +
            "[\uD83D\uDE00-\uD83D\uDE4F]|[\uD83D\uDE80-\uD83D\uDEFF]|" +
            "[\u2600-\u26FF]\uFE0F?|[\u2700-\u27BF]\uFE0F?|\u24C2\uFE0F?|" +
            "[\uD83C\uDDE6-\uD83C\uDDFF]{1,2}|" +
            "[\uD83C\uDD70\uD83C\uDD71\uD83C\uDD7E\uD83C\uDD7F\uD83C\uDD8E\uD83C\uDD91-\uD83C\uDD9A]\uFE0F?|" +
            "[\u0023\u002A\u0030-\u0039]\uFE0F?\u20E3|[\u2194-\u2199\u21A9-\u21AA]\uFE0F?|[\u2B05-\u2B07\u2B1B\u2B1C\u2B50\u2B55]\uFE0F?|" +
            "[\u2934\u2935]\uFE0F?|[\u3030\u303D]\uFE0F?|[\u3297\u3299]\uFE0F?|" +
            "[\uD83C\uDE01\uD83C\uDE02\uD83C\uDE1A\uD83C\uDE2F\uD83C\uDE32-\uD83C\uDE3A\uD83C\uDE50\uD83C\uDE51]\uFE0F?|" +
            "[\u203C\u2049]\uFE0F?|[\u25AA\u25AB\u25B6\u25C0\u25FB-\u25FE]\uFE0F?|" +
            "[\u00A9\u00AE]\uFE0F?|[\u2122\u2139]\uFE0F?|\uD83C\uDC04\uFE0F?|\uD83C\uDCCF\uFE0F?|" +
            "[\u231A\u231B\u2328\u23CF\u23E9-\u23F3\u23F8-\u23FA]\uFE0F?)+");
}

Example of implementation:

public static int detectEmojis(String message){
    int len = message.length(), NumEmoji = 0;
    // if the the given String is only emojis.
    if(isEmoji(message)){
        for (int i = 0; i < len; i++) {
            // if the charAt(i) is an emoji by it self -> ++NumEmoji
            if (isEmoji(message.charAt(i)+"")) {
                NumEmoji++;
            } else {
                // maybe the emoji is of size 2 - so lets check.
                if (i < (len - 1)) { // some Emojis are two characters long in java, e.g. a rocket emoji is "\uD83D\uDE80";
                    if (Character.isSurrogatePair(message.charAt(i), message.charAt(i + 1))) {
                        i += 1; //also skip the second character of the emoji
                        NumEmoji++;
                    }
                }
            }
        }
        return NumEmoji;
    }
    return 0;
}

given is a function that runs on a string (of only emojis) and return the number of emojis in it. (with the help of other answers i found here on StackOverFlow).

Upvotes: 6

slim
slim

Reputation: 41263

It's worth bearing in mind that Java code can be written in Unicode. So you can just do:

@Test
public void containsEmoji_detects_smileys() {
    assertTrue(containsEmoji("This πŸ˜‚ is a smiley "));
    assertTrue(containsEmoji("This πŸ˜„ is a different smiley"));
    assertFalse(containsEmoji("No smiley here"));
}

private boolean containsEmoji(String s) {
    String pattern = ".*[πŸ˜‚πŸ˜„].*";
    return s.matches(pattern);
}

Although see: Should source code be saved in UTF-8 format for discussion on whether that's a good idea.


You can split a String into Unicode codepoints in Java 8 using String.codePoints(), which returns an IntStream. That means you can do something like:

Set<Integer> emojis = new HashSet<>();
emojis.add("πŸ˜‚".codePointAt(0));
emojis.add("πŸ˜„".codePointAt(0));
String s = "1πŸ˜‚34πŸ˜„5";
s.codePoints().forEach( codepoint -> {
    System.out.println(
        new String(Character.toChars(codepoint)) 
        + " " 
        + emojis.contains(codepoint));
});

... prints ...

1 false
πŸ˜‚ true
3 false
4 false
πŸ˜„ true
5 false

Of course if you prefer not to have literal unicode chars in your code you can just put numbers in your set:

emojis.add(0x1F601);

Upvotes: 1

user2474486
user2474486

Reputation: 211

You can use Character class for determining is letter is part of surrogate pair. There some helpful methods to deal with surrogate pairs emoji symbols, for example:

String text = "πŸ’©";
if (text.length() > 1 && Character.isSurrogatePair(text.charAt(0), text.charAt(1))) {
    int codePoint = Character.toCodePoint(text.charAt(0), text.charAt(1));
    char[] c = Character.toChars(codePoint);
}

Upvotes: 4

Chaitanya
Chaitanya

Reputation: 2444

You could use emoji4j library. The following should solve the issue.

String htmlifiedText = EmojiUtils.htmlify(text);
// regex to identify html entitities in htmlified text
Matcher matcher = htmlEntityPattern.matcher(htmlifiedText);

while (matcher.find()) {
    String emojiCode = matcher.group();
    if (isEmoji(emojiCode)) {

        emojis.add(EmojiUtils.getEmoji(emojiCode).getEmoji());
    }
}

Upvotes: 7

tobias_k
tobias_k

Reputation: 82929

It seems like those emojis are two characters long, but with split("") you are splitting between each single character, thus none of those letters can be the emoji you are looking for.

Instead, you could try splitting between words:

for (String word : sentence.split(" ")) {
    if (word.matches(emo_regex)) {
        System.out.println(word);
    }
}

But of course this will miss emojis that are joined to a word, or punctuation.

Alternatively, you could just use a Matcher to find any group in the sentence that matches the regex.

Matcher matcher = Pattern.compile(emo_regex).matcher(sentence);
while (matcher.find()) {
    System.out.println(matcher.group());
}

Upvotes: 4

Related Questions