Reputation: 87
I'm trying to write a code that will text process and eventually index all of them. I first need to remove non-alphabetic characters and punctuation, and convert capital letter to lower case, then remove the stopwords.
Here is what I did so far:
from stopwords import *
def removeStopwords(wordlist, flag):
return [w for w in wordlist if w not in flag]
def preprocessing():
import re
with open('44.txt', 'r', encoding = 'utf8') as data:
for line in data:
a = line.rstrip().lower()
result = re.sub('[^a-zA-Z]', ' ', a)
b = removeStopwords(result, stopwords)
print(b)
if __name__ == '__main__':
preprocessing()
Then I get all the letters break into parts like ['a'], ['w'], ['o'], ['l'], ['f']
stopwords.py is just list of words like:
stopwords = ['a', 'are', 'aren t', ....]
Can somebody tell me what is going on?
Thanks for your time !
Upvotes: 0
Views: 2066
Reputation: 365935
Your first problem, as jedward's answer explains, is that, despite the misleading name wordlist
, what you're passing to removeStopwords
is not a list of words, it's a string—a sequence of individual characters.
If your stoplist were actually made up entirely of single words, the solution would be simple: split the string into words, then remove the ones that match the stoplist.
Unfortunately, if you have things like aren t
in the stoplist, that isn't going to work—"These examples aren't good"
will get preprocessed and split into "these examples aren t good"
, which will split into ["these", "examples", "aren", "t", "good"]
, and obviously none of those words matches "aren t"
.
The ideal solution would be to remove intra-word punctuation instead of converting it to spaces. Something like this:
result = re.sub('[^a-zA-Z]', ' ', re.sub("['_]", '', a))
Then you end up with "these examples arent good"
, and (assuming you write the stopword as "arent"
instead of "aren t"
) the simple solution still works. However, this may not be appropriate for your requirements—it's changing the rules.
So, let's say we can't do that. Then, if you want to keep things simple, you need to actually filter out subsequences, not just individual words.
So, something like this:
def removeStopwords(line, stopwords):
result = []
wordlist = line.split()
i = 0
while i < len(wordlist):
for stopword in stopwords:
stopwordlist = stopword.split()
if wordlist[i:i+len(stopwordlist)] == stopwordlist:
i += len(stopwordlist)
break
else:
result.append(wordlist[i])
i += 1
return ' '.join(result)
If you need it to be faster, you need to preprocess stopwords
into a better data structure, like a trie, that can be quickly scanned for matching prefixes.
Upvotes: 2
Reputation: 1381
Wordlist is just a string.When you are doing
w for w in wordlist if w not in flag
It is iterating over each character of the string ,hence you are getting separate alphabets.Convert wordlist
into a list before passing to removeStopwords
.
def preprocessing():
import re
with open('44.txt', 'r', encoding = 'utf8') as data:
for line in data:
a = line.rstrip().lower()
result = re.sub('[^a-zA-Z]', ' ', a)
result = result.split()#creates a list of words
b = removeStopwords(result, stopwords)
print(b)
Upvotes: 2