dscer
dscer

Reputation: 228

Emoticon Matching - PHP

I need to extract different types of terms from a string. I successfully am extracting alphanumeric characters, currency numbers, and different numerical formats with this regex:

$numalpha = '(\d+[a-zA-Z]+)';
$digitsPattern = '(\$|€|£)?\d+(\.\d+)?';
$wordsPattern = '[\p{L}]+';
preg_match_all('/('.$numalpha. '|' .$digitsPattern.'|'.$wordsPattern.')/ui', $str, $matches);

I also need to match emoticons. I compiled the following regex:

#(^|\W)(\>\:\]|\:-\)|\:\)|\:o\)|\:\]|\:3|\:c\)|\:\>|\=\]|8\)|\=\)|\:\}|\:\^\)|\>\:D|\:-D|\:D|8-D|x-D|X-D|\=-D|\=D|\=-3|8-\)|\>\:\[|\:-\(|\:\(|\:-c|\:c|\:-\<|\:-\[|\:\[|\:\{|\>\.\>|\<\.\<|\>\.\<|\>;\]|;-\)|;\)|\*-\)|\*\)|;-\]|;\]|;D|;\^\)|\>\:P|\:-P|\:P|X-P|x-p|\:-p|\:p|\=p|\:-Þ|\:Þ|\:-b|\:b|\=p|\=P|\>\:o|\>\:O|\:-O|\:O|°o°|°O°|\:O|o_O|o\.O|8-0|\>\:\\|\>\:/|\:-/|\:-\.|\:\\|\=/|\=\\|\:S|\:'\(|;'\()($|\W)#

which seems to work up to a certain extent: code.

It seems that it is not working for emoticons situated at the end of the string, even though I specified

($|\W)

inside the regex.

------------------EDIT-----------------

I removed the ($|W) as Tiddo suggested and it is now matching emoticons at the end of the string. The problem is that the regex, which contains (^|\W), is matching also the character preceding the emoticon.

For a test string:

$str = ":) Testing ,,:) ::) emotic:-)ons ,:( :D :O hsdhfkd :(";

The matches are as follows:

(
[0] => :)
[1] => ,:)
[2] => ::)
[3] => ,:(
[4] =>  :D
[5] =>  :O
[6] =>  :(
)

(The ',', ' ' and ':' are also matched in the ':)' and ':(' terms)

Online code snippet

How can this be fixed?

Upvotes: 1

Views: 973

Answers (1)

anubhava
anubhava

Reputation: 784898

Actually if you change $full assignment to this regex based on positive lookahead:

$full = "#(?=^|\W|\w)(" . $regex .")(?=\w|\W|$)#";

or simply this one without any word boundary:

$full = "#(" . $regex .")#";

It will work as you expect without any problem. See the working code here http://ideone.com/EcCrD

Explanation: In your original code you had:

$full = "#(^|\W)(" . $regex . ")(\W|$)#";

Which is also matching and grabbing word boundaries. Now consider when more than one matching emoticon are separated by just single word boundary such as space. In this case regex matches first emoticon but grabs the text that includes space character. Now for the second emoticon it doesn't find word boundary i.e. \W and fails to grab that.

In my answer I am using positive lookahead but not actually grabbing word boundary and hence it works as expected and matches all emoticons.

Upvotes: 1

Related Questions