Reputation: 75
I have an application where my users have a possibility to write comments. But I want filter insults with special characters.
string comment = "Ðick"; // With special "Ð".
comment = Regex.Replace(comment, @"[^a-z0-9 ]", "[a-z]"); // Replace special char by "[a-z]"
Regex regex = new Regex(@""+comment +""); // @"[a-z]ick"
return (regex.IsMatch("dick")) ? true : false;
When the comment is only "Ðick" the function returns true, but if the comment is "Ðick with another word" the function returns false. Why?
Upvotes: 0
Views: 427
Reputation: 111870
What you are trying to do is often done with a Canonical decomposition plus the stripping of the "Combining Diacritical Marks". You can't do it with pure regex... and even having a little C# you have to do it manually for some characters (like for example Ð
to D
or Ø
to O
). For some other characters you can do it in a more "automated" way (like è
to e
) using the string.Normalize
, like:
string comment = "Ðè";
// Here we split (è) to U+0065 (e) U+0300 (̀)
string commentNormalized = comment.Normalize(NormalizationForm.FormD);
// Here we remove all the UnicodeCategory.NonSpacingMark
// that are the diacritics like U+0300 (̀)
// and rebuild the string. This line can be speedup a little, but
// it would be longer to write :-)
string comment2 = new string(commentNormalized.Where(x => char.GetUnicodeCategory(x) != UnicodeCategory.NonSpacingMark).ToArray());
Now comment2
is "Ðe"
.
This because è has a "Canonical decomposition" U+0065 (e) U+0300 (̀)
, so you can find that è
is "similar" to e
, while for Ð it's "Canonical decomposition" is still U+00D0 (Ð)
so the same character.
What you are trying to do is futile: when you ban a character, the users will find another "similar" character... have you ever heard of Leet? Is D1ck
(1
instead of i
) better than your word? :-)
It is normally better to have a dictionary of "banned words", that has both Dork
and Ðork
, and when you find a new permutation of an "offensive" word, you simply add it. Human fantasy is infinite... so must be your dictionary :-) but one word at a time.
Upvotes: 2