Reputation: 13
I have 2 txt-docs. One contains some sentences and one contains some bad-words. I wanna find all sentences containing a word from the bad-word-list and remove that line (the whole sentence). But only when a word from the bad-word-list stands alone, not if it is part of another word. For example, I want to remove "on" but not "onsite". Any advice?
#bad_words = ["on", "off"]
#sentences = ["Learning Python is an ongoing task", "I practice on and off", "I do it offline", "On weekdays i practice the most", "In weekends I am off"]
def clean_sentences(sentences,bad_words, outfile, badfile):
bad_words_list = []
with open(bad_words) as wo:
bad_words_list=wo.readlines()
b_lists=list(map(str.strip, bad_words_list))
for line in b_lists:
line=line.strip('\n')
line=line.lower()
bad_words_list.insert(len(bad_words_list),line)
with open(sentences) as oldfile, open(outfile, 'w') as newfile, open(badfile, 'w') as badwords:
for line in oldfile:
if not any(bad_word in line for bad_word in bad_words):
newfile.write(line)
else:
badwords.write(line)
clean_sentences('sentences.txt', 'bad_words.txt', 'outfile.txt', 'badfile.txt')
Upvotes: 0
Views: 733
Reputation: 380
Instead of checking if any of the bad words is in a sentence, you should check if any of the bad words is in the split
of the sentence (so you only get the bad words when they are separate words in a sentence and not just an arbitrary substring of it)
Here is a simplified version of your code (without the file handling)
bad_words = ["on", "off"]
sentences = ["Learning Python is an ongoing task", "I practice on and off", "I do it offline", "On weekdays i practice the most", "In weekends I am off"]
def clean_sentences(sentences, bad_words):
for sentence in sentences:
if any(word in map(lambda str: str.lower(), sentence.split()) for word in bad_words):
print(f'Found bad word in {sentence}')
clean_sentences(sentences, bad_words)
# output
Found bad word in I practice on and off
Found bad word in On weekdays i practice the most
Found bad word in In weekends I am off
With regards to your own code, just update
if not any(bad_word in line for bad_word in bad_words):
newfile.write(line)
to
if not any(bad_word in map(lambda str: str.lower(), line.split()) for bad_word in bad_words):
newfile.write(line)
EDIT: in order to make the search case-insensitive, use the lower case version of the words in the sentence (assuming the bad words are themselves lower case).
I've updated the code with a map
and a simple lambda
function
Upvotes: 1