Reputation: 137

How do I replace words (in a txt file) that match my list of strings?

I am looking to understand what it takes to replace certain words from my .TXT file.
- Those certain words would be strings in a censored_words list.

I was successful with a single-word replacement:

email = open('email.txt', 'r').read()

def single_string_replace(email):
    return email.replace('word1', 'REDACTED')

But I could not get a list of words to work "flawlessly". This is my attempt:

email = open('email.txt', 'r').read()
banned_words = ['word1', 'phrase one']

def list_replace(email):
    list_place = 0
    while list_place < len(banned_words):
        for word in banned_words:
            email = email.replace(word, 'REDACTED')
            list_place += 1
        return email

I am optimally looking to keep the .TXT files unchanged, and only seeing the changes by a print() statement such as

print(list_replace(email))

The issue that I am having is:

Let's say I ban a word such as dog, and also ban the word hotdog. If 'dog' is first up in the list, then when 'hotdog' is being searched for ban, it doesn't find any.
This creates 'hotREDACTED' instead of 'REDACTED'.
Vice versa as well. If I want the word dog banned, but hotdog is okay, how can I make sure both of these above cases work without kinks?

As always has been, is, and shall be: all suggestions are welcome!

Thank you

Upvotes: 3

Answers (3)

kederrac

Reputation: 17322

you could use re.sub:

import re


email = open('email.txt', 'r').read()
banned_words = ['word1', 'phrase one']
pattern = '|'.join(f'\\b{w}\\b' for w in banned_words)

def list_replace(email):
    return re.sub(pattern, 'REDACTED', email)

print(list_replace(email))

Upvotes: 1

SidharthMacherla

Reputation: 400

Here is a function that replaces words. One could change the swlist in the function to add or delete more such stop words.

Function to replace text

from nltk import word_tokenize

def mask_word(with_sw):
    swlist = ['dog','cat']
    without_sw = ""
    char = 'nan'
    tokens = word_tokenize(with_sw)
    for char in tokens:
        if char in swlist:
            without_sw = without_sw + " " + "REDACTED"
        else:
            without_sw = without_sw + " " + char

    return(without_sw)

An example usage is below

text = "this is a dog and hotdog test"

print(mask_word(text))

Output looks like this:

this is a REDACTED and hotdog test

Upvotes: 1

Sohail Saha

Reputation: 573

Try it in this way

words = open('email.txt').read().split() #to get a list of words
words = [word.replace('\n','') for word in words] #removing all newlines if any
censored_words = ['ADD', 'YOUR', 'WORDS', 'HERE']

for word in words:
    if word in censored_words:
        print(word) #printing all the occurences of censored words

Upvotes: 0

How do I replace words (in a txt file) that match my list of strings?

Answers (3)

Related Questions