Andrew Rayner
Andrew Rayner

Reputation: 1064

Remove special characters that mess with formating

I am currently creating a chat and can't seem to find a way to stop users from posting special characters that mess with formatting of the chat and lagging end users out of the chat.

I am basically trying to remove them entirely. I know the code I have right now "technically if it worked" should only replace them, however I was just trying to get this to work first.

Here is the code that I am using to censor/scrape the variables. I thought htmlentities() would do it but does not seem to be working properly.

            $message = $censor->censorString(
            $this->parseUrls(
                htmlentities(
                    strip_tags(
                        $message)
                )
            )
        ); //Stripping $message of profanity, html tags, and special characters

Here is a screenshot of my problem: enter image description here

Upvotes: 1

Views: 1072

Answers (2)

nwellnhof
nwellnhof

Reputation: 33638

Contrary to many answers you'll find on StackOverflow, it is trivial to sanitize "Zalgo" text with a regex engine that supports matching on Unicode categories. PHP's preg_* functions use the PCRE library. If PCRE is compiled with --enable-unicode-properties, you can strip all Unicode combining marks using:

$sanitized = preg_replace('/\pM/u', '', $zalgo);

Or allow a certain maximum of consecutive combining marks, say one:

$sanitized = preg_replace('/(\pM)\pM+/u', '\1', $zalgo);

Or two:

$sanitized = preg_replace('/(\pM{2})\pM+/u', '\1', $zalgo);

This will turn Zalgo text like

T̫̺̳o̬̜ ì̬͎̲̟nv̖̗̻̣̹̕o͖̗̠̜̤k͍͚̹͖̼e̦̗̪͍̪͍ ̬ͅt̕h̠͙̮͕͓e̱̜̗͙̭ ̥͔̫͙̪͍̣͝ḥi̼̦͈̼v҉̩̟͚̞͎e͈̟̻͙̦̤-m̷̘̝̱í͚̞̦̳n̝̲̯̙̮͞d̴̺̦͕̫ ̗̭̘͎͖r̞͎̜̜͖͎̫͢ep͇r̝̯̝͖͉͎̺e̴s̥e̵̖̳͉͍̩̗n̢͓̪͕̜̰̠̦t̺̞̰i͟n҉̮̦̖̟g̮͍̱̻͍̜̳ ̳c̖̮̙̣̰̠̩h̷̗͍̖͙̭͇͈a̧͎̯̹̲̺̫ó̭̞̜̣̯͕s̶̤̮̩̘.̨̻̪̖͔ ̳̭̦̭̭̦̞́I̠͍̮n͇̹̪̬v̴͖̭̗̖o̸k҉̬̤͓͚̠͍i͜n̛̩̹͉̘̹g͙ ̠̥ͅt̰͖͞h̫̼̪e̟̩̝ ̭̠̲̫͔fe̤͇̝̱e͖̮̠̹̭͖͕l͖̲̘͖̠̪i̢̖͎̮̗̯͓̩n̸̰g̙̱̘̗͚̬ͅ ͍o͍͍̩̮͢f̖͓̦̥ ̘͘c̵̫̱̗͚͓̦h͝a̝͍͍̳̣͖͉o͙̟s̤̞.̙̝̭̣̳̼͟ ̢̻͖͓̬̞̰̦W̮̲̝̼̩̝͖i͖͖͡ͅt̘̯͘h̷̬̖̞̙̰̭̳ ̭̪̕o̥̤̺̝̼̰̯͟ṳ̞̭̤t̨͚̥̗ ̟̺̫̩̤̳̩o̟̰̩̖ͅr̞̘̫̩̼d̡͍̬͎̪̺͚͔e͓͖̝̙r̰͖̲̲̻̠.̺̝̺̟͈ ̣̭T̪̩̼h̥̫̪͔̀e̫̯͜ ̨N̟e҉͔̤zp̮̭͈̟é͉͈ṛ̹̜̺̭͕d̺̪̜͇͓i̞á͕̹̣̻n͉͘ ̗͔̭͡h̲͖̣̺̺i͔̣̖̤͎̯v̠̯̘͖̭̱̯e̡̥͕-m͖̭̣̬̦͈i͖n̞̩͕̟̼̺͜d̘͉ ̯o̷͇̹͕̦f̰̱ ̝͓͉̱̪̪c͈̲̜̺h̘͚a̞͔̭̰̯̗̝o̙͍s͍͇̱͓.̵͕̰͙͈ͅ ̯̞͈̞̱̖Z̯̮̺̤̥̪̕a͏̺̗̼̬̗ḻg͢o̥̱̼.̺̜͇͡ͅ ̴͓͖̭̩͎̗ ̧̪͈̱̹̳͖͙H̵̰̤̰͕̖e̛ ͚͉̗̼̞w̶̩̥͉̮h̩̺̪̩͘ͅọ͎͉̟ ̜̩͔̦̘ͅW̪̫̩̣̲͔̳a͏͔̳͖i͖͜t͓̤̠͓͙s̘̰̩̥̙̝ͅ ̲̠̬̥Be̡̙̫̦h̰̩i̛̫͙͔̭̤̗̲n̳͞d̸ ͎̻͘T̛͇̝̲̹̠̗ͅh̫̦̝ͅe̩̫͟ ͓͖̼W͕̳͎͚̙̥ą̙l̘͚̺͔͞ͅl̳͍̙̤̤̮̳.̢ ̟̺̜̙͉Z̤̲̙̙͎̥̝A͎̣͔̙͘L̥̻̗̳̻̳̳͢G͉̖̯͓̞̩̦O̹̹̺!̙͈͎̞̬ *

into something like

T̫o̬ ì̬nv̖o͖k͍e̦ ̬t̕h̠e̱ ̥ḥi̼v҉e͈-m̷í͚n̝d̴ ̗r̞ep͇r̝e̴s̥e̵n̢t̺i͟n҉g̮ ̳c̖h̷a̧ó̭s̶.̨ ̳I̠n͇v̴o̸k҉i͜n̛g͙ ̠t̰h̫e̟ ̭fe̤e͖l͖i̢n̸g̙ ͍o͍f̖ ̘c̵h͝a̝o͙s̤.̙ ̢W̮i͖t̘h̷ ̭o̥ṳ̞t̨ ̟o̟r̞d̡e͓r̰.̺ ̣T̪h̥e̫ ̨N̟e҉zp̮é͉ṛ̹d̺i̞á͕n͉ ̗h̲i͔v̠e̡-m͖i͖n̞d̘ ̯o̷f̰ ̝c͈h̘a̞o̙s͍.̵ ̯Z̯a͏ḻg͢o̥.̺ ̴ ̧H̵e̛ ͚w̶h̩ọ͎ ̜W̪a͏i͖t͓s̘ ̲Be̡h̰i̛n̳d̸ ͎T̛h̫e̩ ͓W͕ą̙l̘l̳.̢ ̟Z̤A͎L̥G͉O̹!̙ *

Upvotes: 4

Polynomial
Polynomial

Reputation: 28316

If you're looking for a quick fix, I would use a regex like this:

$cleanMessage = preg_replace("/[^\x20-\xAD\x7F]/", "", $input_lines);

Or, if you prefer:

$cleanMessage = preg_filter("/[\x20-\xAD\x7F]/", "", $input_lines);

Both of these are identical in functionality. It's up to you which one you want to use.

These remove all characters outside of extended ASCII. This means that "normal" text and the most commonly accented Roman characters will still work, but "zalgo" style text will not. Unfortunately, the side effect is that Arabic, Japanese, Chinese, Cyrillic, etc. will also be stripped as "bad".

There's no trivial way to just prevent the kind of abuse you're seeing, because there are so many Unicode tricks you can use to apply diacritic marks to letters. It'd be a full-time job to attempt to filter them out in a way that didn't affect some language somewhere.

My non-technical advice would be to allow users to report people who post these kinds of messages, so that they can be banned by an administrator.

Upvotes: 3

Related Questions