windweller
windweller

Reputation: 2385

convert Java regex OR operator to Scala Regex

I'm doing a project with Twitter and one part is to take out all emoticons in a tweet so it doesn't trip the parser. I took a look at Carnegie Mellon's Ark Tweet NLP and it's pretty amazing and they have this really nice Java Regex pattern to detect emoticons!

However, I'm not exactly familiar with Java's regex syntax (I'm familiar with the basic ones)

https://github.com/brendano/ark-tweet-nlp/blob/master/src/cmu/arktweetnlp/Twokenize.java

The code I need to convert to Scala looks like this:

public static String emoticon = OR(
        // Standard version  :) :( :] :D :P
        "(?:>|>)?" + OR(normalEyes, wink) + OR(noseArea,"[Oo]") + 
            OR(tongue+"(?=\\W|$|RT|rt|Rt)", otherMouths+"(?=\\W|$|RT|rt|Rt)", sadMouths, happyMouths),


        // reversed version (: D:  use positive lookbehind to remove "(word):"
        // because eyes on the right side is more ambiguous with the standard usage of : ;
        "(?<=(?: |^))" + OR(sadMouths,happyMouths,otherMouths) + noseArea + OR(normalEyes, wink) + "(?:<|&lt;)?",


        //inspired by http://en.wikipedia.org/wiki/User:Scapler/emoticons#East_Asian_style
        eastEmote.replaceFirst("2", "1"), basicface
        // iOS 'emoji' characters (some smileys, some symbols) [\ue001-\uebbb]  
        // TODO should try a big precompiled lexicon from Wikipedia, Dan Ramage told me (BTO) he does this
);

The OR operator is a bit confusing.

So can anyone let me know how to do the conversion? Also after the conversion, all I need to do is a quick splitting tweets into words and see word.contains(emoticon) right? Thank you!


It seems like the above question is rather idiotic. However, there's the last bit of task I don't know:

I'm taking those emoticons out of my sentence. Will it work if I just split my sentences by space into words and do for (word <- words if !word.contains(regexpattern))?

Upvotes: 1

Views: 435

Answers (1)

Will Fitzgerald
Will Fitzgerald

Reputation: 1382

You can use this function:

def OR(patterns : String*) = patterns.map{p => s"(?:$p)"}.mkString("|")

Upvotes: 2

Related Questions