Reputation: 43

Filter foreign stopwords in text file

I have a list of movie names in English and several foreign languages compiled in a text file, with each name printed in a new line:

Kein Pardon
Kein Platz f¸r Gerold
Kein Sex ist auch keine Lˆsung
Keine Angst Liebling, ich pass schon auf
Keiner hat das Pferd gek¸sst
Keiner liebt mich
Keinohrhasen
Keiro's Cat
La Prima Donna
La Primeriza
La Prison De Saint-Clothaire
La Puppe
La P·jara
La PÈrgola de las Flores

I have compiled a short list of common non-English stopwords that I would like to filter from the text file eg. La, de, las, das. What can I do to read my text, filter the words and then print the filtered list into a new text file in the original format? The desired output should roughly look like this:

Kein Pardon
Kein Platz f¸r Gerold
Kein Sex keine Lˆsung
Keine Angst Liebling, pass schon
Keiner hat Pferd gek¸sst
Keiner liebt mich
Keinohrhasen
Keiro's Cat
Prima Donna
Primeriza
Prison Saint-Clothaire
Puppe
P·jara
Èrgola Flores

To clarify, I know there is an approach to use the NLTK library, which has a more comprehensive list of stopwords, but I'm looking for an alternative where I'm just targeting a few selected words from my own list.

Upvotes: 0

Answers (3)

user3885927

Reputation: 3503

You can use re module (https://docs.python.org/2/library/re.html#re.sub ) to replace your unwanted strings with blanks. Something like this should work:

    import re
    #save your undesired text here. You can use a different data structure
    #  if the list is big and later build your match string like below
    unDesiredText = 'abc|bcd|vas'

    #set your inputFile and outputFile appropriately
    fhIn = open(inputFile, 'r')
    fhOut = open(outputFile, 'w')

    for line in fhIn:
        line = re.sub(unDesiredText, '', line)
        fhOut.write(line)

    fhIn.close()
    fhOut.close

Upvotes: 1

Raiyan

Reputation: 1697

Another approach, in case you are interested in exception handling and other relevant details:

import re

stop_words = ['de', 'la', 'el']
pattern = '|'.join(stop_words)
prog = re.compile(pattern, re.IGNORECASE)  # re.IGNORECASE to catch both 'La' and 'la' 

input_file_location = 'in.txt'
output_file_location = 'out.txt'

with open(input_file_location, 'r') as fin:
    with open(output_file_location, 'w') as fout:
        for l in fin:
            m = prog.sub('', l.strip())  # l.strip() to remove leading/trailing whitespace
            m = re.sub(' +', ' ', m)  # suppress multiple white spaces
            fout.write('%s\n' % m.strip())

Upvotes: 1

airpierre

Reputation: 155

Read in the file:

with open('file', 'r') as f:
    inText = f.read()

Have some sort of function that you provide a string you don't want in the text, but you can do this with the whole text at once, not just line by line. Also, you want to use the text globally, so I'd say make a class:

class changeText( object ):
    def __init__(self, text):
        self.text = text
    def erase(self, badText):
        self.text.replace(badText, '')

However, when you replace a word with nothing, double spaces appear, as well as \n followed by space, so make a method to clean up resulting text.

    def cleanup(self):
        self.text.replace('  ', ' ')
        self.text.replace('\n ', '\n')

Initialize object:

textObj = changeText( inText )

Then iterate through list of bad words and clean up:

for bw in badWords:
    textObj.erase(bw)
textObj.cleanup()

Lastly, write it:

with open('newfile', 'r') as f:
    f.write(textObj.text)

Upvotes: 0

Filter foreign stopwords in text file

Answers (3)

Related Questions