Reputation: 63
I'm trying to build a bad words filter in python (I accept non-coded answers, just need to know more about an algorithm that would work) and I need to know how I can check if a string, contains a specific word in any variation.
For example, let's say my bad word array is:
['others','hello','banana']
And the String I need to check is:
Thinking alike or understanding something in a similar way with others.
For now, I'm looping on the string by checking every time if any element of the array exists in the phrase, but what if I want to check variations of the array? Like 0th3rs
,Oth3r5
for the first element? For now, I'm manually checking it by doing multiple if statements and replacing a
with @
etc... But this would not be good for a production code since I cannot prevent every scenario of character replacing, So I thought of something like an array of objects, where the index is the letter, like A which contains an array of its variations and check it dynamically in the string, but would this take too much time? Since it needs to check every type of word variation? Or is this achievable and usable in a real scenario?
Upvotes: 0
Views: 1002
Reputation: 3420
I cannot prevent every scenario of character replacing
That's true. However, you can handle the majority of scenarios.
I would consider declaring a mapping of replacements and their meaning:
REPLACEMENTS_DICT = {
"@": "a",
"4": "a",
"3": "e",
"0": "o",
...
}
Then, before checking if a particular string is inside the bad_word_array, one should translate the string with regard to the replacement dict and then make a case-insensitive comparison:
def translate(word: str) -> str:
return "".join(REPLACEMENTS_DICT.get(c, c) for c in word).lower()
def is_bad_word(word: str) -> bool:
return translate(word) in BAD_WORDS
Example
BAD_WORDS = ["others", "hello", "banana"]
print(is_bad_word("0th3rs")) # True
print(is_bad_word("Oth3rs")) # True
For tokenizing the text into words you can use nltk.
import nltk
sentence = "Thinking alike or understanding something in a similar way with others."
words = nltk.word_tokenize(sentence)
for word in words:
assert is_bad_word(word)
Upvotes: 0
Reputation: 264
Have you try using replace()?
For example:
input="0th3rs"
replace_pair={'0':'o','3':'e'}
for old, new in replace_pair.items():
input = input.replace(old, new)
print(input)
will give you "others"
You have to still provide the replacement pairs but that would be better than "if" statement.
Upvotes: 1
Reputation: 1
can't you just extend your list of bad words to contain different variations?
bad_words = ["others", "0th3rs", "banana"]
text = "this is the text about bananas and 0th3rs"
for word in bad_words:
if word in text:
text = text.replace(word, "*flowers*")
Upvotes: 0