Reputation: 41
I need to remove all non alphanumeric characters except spaces and allowed emoticons.
Allowed emoticons are :)
, :(
, :P
etc (the most popular).
I have a string:
$string = 'Hi! Glad # to _ see : you :)';
so I need to process this string and get the following:
$string = 'Hi Glad to see you :)';
Also please pay attention emoticons can contain spaces
e.g.
: ) instead of :)
or
: P instead of :P
Does anyone have a function to do this?
If someone helped me it would be so great :)
UPDATE
Thank you very much for your help.
buckley offered ready solution,
but if string contains emoticons with spaces
e.g. Hi! Glad # to _ see : you : )
result is equal to Hi Glad to see you
as you see emoticon : ) was cut off.
Upvotes: 4
Views: 1015
Reputation: 23892
I'd use this regex,
(?i)(:\s*[)p(])(*SKIP)(*FAIL)|[^a-z0-9 ]
Demo: https://regex101.com/r/nW6iL3/2
PHP Usage:
$string = ': ) instead of :)
or
: P instead of :P
Hi! Glad # to _ see : you :)';
echo preg_replace('~(?i)(:\s*[)p(])(*SKIP)(*FAIL)|[^a-z0-9 ]~', '', $string);
Output:
: ) instead of :)or: P instead of :PHi Glad to see you :)
Demo: https://eval.in/416394
If the closing part of the emoticon changes or you have others you can add them inside this character class [)p(]
.
You also could change the eyes by changing the :
to a character class so you could do
(?i)([:;]\s*[)p(])(*SKIP)(*FAIL)|[^a-z0-9 ]
If you also wanted to allow the winking faces (I think semicolon is the wink)..
Update
Bit by bit explanation...
(?i)
= make the regex case insensitive
:
= search for the eyes (a colon)
\s*
= search for zero or more (the * is 0 or more of the preceding character) whitespace characters (\h
might be better here, \s
includes new lines and tabs)
[)p(]
= this is a character class allowing any of the characters inside it to be present. so )
, p
, or (
are allow allowed here.
(*SKIP)(*FAIL)
= if we found the previous regex ignore it, www.rexegg.com/regex-best-trick.html.
|
= or
[^a-z0-9 ]
- a negated character class meaning any character not in this list find.
The regex101 also has documentation on the regex.
Upvotes: 1
Reputation: 8332
I don't "speak" php ;) but this does it in JS. Maybe you can convert it.
var sIn = 'Hi! Glad # to _ see : you :)',
sOut;
sOut = sIn.match(/([\w\s]|: ?\)|: ?\(|: ?P)*/g).join('');
It works the otherway around from your attempt - it finds all "legal" characters/combinations and joins them together.
Regards
Edit: Updated regex to handle optional spaces in emoticons (as commented earlier).
Upvotes: 3
Reputation: 14069
Here is an updated answer that meets the new requirement that an emoticon can contain a space
Replace
((:\))|(:\()|(:P)|(: \))|: P)|[^0-9a-zA-Z\r\n ]
With
$1
Formatted in free spacing mode this becomes
(?x)
(
(?::\))|
(?::\()|
(?::P)|
(?::\ \))|
:\ P
)|
[^0-9a-zA-Z\r\n ]
In PHP
$result = preg_replace('/((:\))|(:\()|(:P)|(: \))|: P)|[^0-9a-zA-Z\r\n ]/', '$1', $subject);
The idea is that we start the regex with the emoticons that are contain multiple characters which individually can contain an illegal character.
This group is captured and later used as a replacement $1
Then, after the alternation, we use a whitelist of characters that we negate so it will be matched but won't be mentioned in the replaced pattern.
Everything that is not matched (our whitelist) will be repeated in the result as is the convention.
On thing to not is that there is a lot of grouping when listing the emoticons which can hinder performance. To prevent this we can make the regex a bit more verbose:
((?::\))|(?::\()|(?::P)|(?:: \))|: P)|[^0-9a-zA-Z\r\n ]
The multiple consecutive spaces remain and can't be solved in 1 sweep AFAIK.
Upvotes: 2
Reputation: 14069
Ha! This one was interesting
Replace
(?!(:\)|:\(|:P))[^a-zA-Z0-9 ](?<!(:\)|:\(|:P))
With nothing
The idea is that you sandwich the illegal characters with the same regex once as a negative lookhead and once as negative lookbehind.
The result will have consecutive spaces in it. This is something that a regex cannot do in 1 sweep AFAIK cause it can't look at multiple matches at once.
To eliminate the consecutive spaces you can replace \s+
with (an empty space)
Upvotes: 2
Reputation: 361
Here is a string formatter that could do the job making the assumption that emoticons are 2 characters long in general:
<?php
class StringFormatter
{
private $blacklist;
private $whitelist;
public function __construct(array $blacklist, array $whitelist)
{
$this->blacklist = $blacklist;
$this->whitelist = $whitelist;
}
public function format($str)
{
$strLen = strlen($str);
$result = '';
$counter = 0;
while ($counter < $strLen) {
// get a character from the string
$char = substr($str, $counter, 1);
// if not blacklisted, allow it in the result
if (!in_array($char, $this->blacklist)) {
$result .= $char;
$counter++;
continue;
}
// if we reached the last letter, break out of the loop
if ($counter >= $strLen - 1) {
break;
}
// we assume all whitelisted entries have same length (e.g. 2
// for emoticons)
if (in_array(substr($str, $counter, 2), $this->whitelist)) {
$result .= substr($str, $counter, 2);
$counter += 2;
} else {
$counter++;
}
}
return $result;
}
}
// example usage
// $whitelist is not the entire whitelist, actually it's the exceptions
// to the blacklist, so more complext strings including blacklisted characters that should be allowed
$formatter = new StringFormatter(['#', '_', ':', '!'], [':)', ':(']);
echo $formatter->format('Hi! Glad # to _ see : you :)');
The code above can be further refactored to be cleaner, but you get the picture.
Upvotes: 1