Reputation: 923
How can I use a Java Regex to match a banned word, for example if I wanted to ban the word stackoverflow it would match stackoverflow, s t a c k o v e r f l o w and s-t-a-c-k-o-v-e-r-f-l-o-w.
The purpose of this is to stop people from saying banned words in chat. The regex must also work so that their can be anything at either side. For example "Go to stackoverflow, its a good website" would detect stackoverflow.
Upvotes: 1
Views: 304
Reputation: 2364
Though you were asking for a specifically regex-based answer, regex doesn't always scale up to what is needed: especially when handling erratic human input.
There are a few String-similarity algorithms, which, when combined with a simple preliminary phase like Fairmutex's answer, can provide a much more comprehensive ban filter.
One popular algorithm used is Levenshtein Distance. While it is fast, it is based heavily on order of words, so searching for "Stack Overflow" in an input of "Overflow Stack" will give you a negative.
For previous projects of mine I've used this clever algorithm, which takes into account the latter predicament. While it is a bit heavier, it does the job better than regex and Levenshtein Distance.
Another idea would be to run the input through the Strike-a-Match algorithm I linked earlier, and if the input falls above a specific threshold in similarity (say, 50%+ match), run through a specific general-leet filter. This would function on the basis of replacing commonly-used leet speak. For example "|\|" would get replace with "n", regardless of spacing.
Upvotes: 0
Reputation: 722
What you can do is strip all non alphanumeric characters here and then match to your banned words. But this will not completely eliminate the chance of conveying foul words to your audience. for example people can use leet of which human cognitive system can understand anyway for example the word "Long" can be written as "|0ng" I will not use real examples so as to keep it clean. for example "Alexander" can be "/\ | 3 >< /\ |\| c| 3 r"
Upvotes: 1