Prateek Narendra
Prateek Narendra

Reputation: 1937

Match all symbols except emoticons

I use Ruby 1.8.7 I have a regex that is valid in Ruby 1.8.7 to match all emoticons

/\|?>?[:*;Xx8=<(%)D]-?'?,?o?\_^?[-DOo0S*Ppb3c:;\/\\|)(}{\]><]\)?|\(/

However , I want to match all set of symbols except the ones in this regex For example,the following string

hi =as.) friend:) haha yay! ;) =) (test test) R&R I.O.U. :> :} :{ :< :<  =) :S ;o) >:) :-| :| :o :*) %-( )-: ): )o: 8-0 8/ 8\ 8c :'( :'-( :( :*( :,( :-( :-/ :-S :-\ :-| :/ :O :S :\ :| =( >:( D: (o; 8-) ;) ;o) %-) (-: (: (o: 8) :) :-D :-P :D :P :P :] :o) :p <3 =) =] >:) >:D >=D 

I need it to match

= .) () & . . .

Refer to - http://rubular.com/r/QpteIutq3B

How can I achieve this ?

Upvotes: 1

Views: 509

Answers (1)

Aran-Fey
Aran-Fey

Reputation: 43316

I think this is a very difficult task to do with regex.

My first idea was to use a negative lookahead assertion (that matches emoticons) before matching a symbol, like

(?!\|?>?[:*;Xx8=<(%)D]-?'?,?o?\_?[-DOo0S*Ppb3c:;\/\\|)(}{\]><]\)?|\()[:;._()]
# works like "if no emoticon at this position, then match a symbol"

, but that doesn't work. (See demo.) This is partly because your pattern detects many false positives (matches stuff that is no emoticon), but it also has a fundamental problem: It won't match the first character in an emoticon, but it will match the rest of the emoticon. Maybe a more experienced regex user can make this work with fancy regex magic though.


All that being said, there's only one way I can think of: For each character you want to match, use lookbehind and lookahead assertions to make sure it's not part of any emoticon. This is a lot of work. For example, to match the characters =:; , I came up with the following pattern:

(?<![(){}\[\]<>|D])(?<![(){}\[\]<>|][o-])[=:;](?!'?[o*,-]?[(){}\[\]<>|PpD\\\/OSso0])

The basic idea is this: The characters =:; are usually used as an emoticon's eyes. Therefore, we have to assert there's no (optional) nose o*,- and no mouth (){}[]<>|PpD\/OSso0 either to the left OR right. To make things even worse, lookbehind assertions don't allow quantifiers, hence the duplicate (?<![(){}\[\]<>|D]) and (?<![(){}\[\]<>|][o-]) (one of which matches mouths, while the other matches a mouth and a nose).

Constructing the full pattern to match all special characters would require a lot of effort, and it'd probably be horribly long and confusing.

If you aren't forced to do this with pure regex, I'd recommend using regex to remove all emoticons from the string, and then find all remaining symbols.


P.S. I made this pattern to match emoticons, it works reasonably well with rotated smiley faces like :x, >:|, (:, and the like. It should also produce fewer false positives than your pattern.

UPDATE #2: Pattern no longer matches numbers. Added support for eastern smiley faces. Misc small improvements. Matches a decent amount of Wikipedia's list of emoticons now. (See demo)

(?!\d\d)(?![a-zA-Z]{2})(?:(?:>?[:;=%8BXx]['‘’]?[-o*,^っ]?(?:(?P<mouth>[()|Il])(?P=mouth)*|[\/0\]o\\D\[PpSs<>{}CcOXx*3@ÞþbL&?$#]))|(?:[()\\{}\/<\[>\]DOo0|SsXxlI*@q][-o*,]?['‘’]?[:=8;%Xx]<?))|(?P<head>\()?(?:(?P<eye>[<>v*.^~=ಠ-])?[_.-](?P=eye)|[o0O][_.-][o0O]|>[_.-]?<)['‘’]?(?(head)\))|xD|XD|XP|xP|DX|<3|\^\^|\\o\/|o\/|\\o

Upvotes: 1

Related Questions