Reputation: 87

How to use stopword in the preprocessing a txt file

I'm trying to write a code that will text process and eventually index all of them. I first need to remove non-alphabetic characters and punctuation, and convert capital letter to lower case, then remove the stopwords.

Here is what I did so far:

from stopwords import *

def removeStopwords(wordlist, flag):
    return [w for w in wordlist if w not in flag]

def preprocessing():
    import re
    with open('44.txt', 'r', encoding = 'utf8') as data:
        for line in data:
            a = line.rstrip().lower()
            result = re.sub('[^a-zA-Z]', ' ', a)
            b = removeStopwords(result, stopwords)
            print(b)

if __name__ == '__main__':
    preprocessing()

Then I get all the letters break into parts like ['a'], ['w'], ['o'], ['l'], ['f']

stopwords.py is just list of words like:

stopwords = ['a', 'are', 'aren t', ....]

Can somebody tell me what is going on?

Thanks for your time !

Upvotes: 0

Answers (2)

abarnert

Reputation: 365935

Your first problem, as jedward's answer explains, is that, despite the misleading name wordlist, what you're passing to removeStopwords is not a list of words, it's a string—a sequence of individual characters.

If your stoplist were actually made up entirely of single words, the solution would be simple: split the string into words, then remove the ones that match the stoplist.

Unfortunately, if you have things like aren t in the stoplist, that isn't going to work—"These examples aren't good" will get preprocessed and split into "these examples aren t good", which will split into ["these", "examples", "aren", "t", "good"], and obviously none of those words matches "aren t".

The ideal solution would be to remove intra-word punctuation instead of converting it to spaces. Something like this:

result = re.sub('[^a-zA-Z]', ' ', re.sub("['_]", '', a))

Then you end up with "these examples arent good", and (assuming you write the stopword as "arent" instead of "aren t") the simple solution still works. However, this may not be appropriate for your requirements—it's changing the rules.

So, let's say we can't do that. Then, if you want to keep things simple, you need to actually filter out subsequences, not just individual words.

So, something like this:

def removeStopwords(line, stopwords):
    result = []
    wordlist = line.split()
    i = 0
    while i < len(wordlist):
        for stopword in stopwords:
            stopwordlist = stopword.split()
            if wordlist[i:i+len(stopwordlist)] == stopwordlist:
                i += len(stopwordlist)
                break
        else:
            result.append(wordlist[i])
            i += 1
    return ' '.join(result)

If you need it to be faster, you need to preprocess stopwords into a better data structure, like a trie, that can be quickly scanned for matching prefixes.

Upvotes: 2

avinash pandey

Reputation: 1381

Wordlist is just a string.When you are doing

w for w in wordlist if w not in flag

It is iterating over each character of the string ,hence you are getting separate alphabets.Convert wordlist into a list before passing to removeStopwords.

def preprocessing():
    import re
    with open('44.txt', 'r', encoding = 'utf8') as data:
        for line in data:
            a = line.rstrip().lower()
            result = re.sub('[^a-zA-Z]', ' ', a)
            result = result.split()#creates a list of words
            b = removeStopwords(result, stopwords)
            print(b)

Upvotes: 2

How to use stopword in the preprocessing a txt file

Answers (2)

Related Questions