Reputation: 550
i am using NLTK to remove stopwords from a list element. Here is my code snippet
dict1 = {}
for ctr,row in enumerate(cur.fetchall()):
list1 = [row[0],row[1],row[2],row[3],row[4]]
dict1[row[0]] = list1
print ctr+1,"\n",dict1[row[0]][2]
list2 = [w for w in dict1[row[0]][3] if not w in stopwords.words('english')]
print list2
the problem is, this not only removing the stopwords but also it is removing characters from other words e.g. from the word 'orientation' 'i' and more stopwords will be removed and further it is storing characters instead of words in the list2. i.e. ['O', 'r', 'e', 'n', 'n', ' ', 'f', ' ', '3', ' ', 'r', 'e', 'r', 'e', ' ', 'p', 'n', '\n', '\n', '\n', 'O', 'r', 'e', 'n', 'n', ' ', 'f', ' ', 'n', ' ', 'r', 'e', 'r', 'e', ' ', 'r', 'p', 'l'....................... while i want to store it as ['Orientation','....................
Upvotes: 3
Views: 1579
Reputation: 812
First, your construction of list1 is a little peculiar to me. I think that there's a more pythonic solution:
list1 = row[:5]
Then, is there a reason you're accessing row[3] with dict1[row[0]][3], rather than row[3] directly?
Finally, assuming that row was a list of strings, constructing list2 from row[3] iterates over every character, rather than every word. That might be why you're parsing out 'i' and 'a' (and a few other characters).
The correct comprehension would be:
list2 = [w for w in row[3].split(' ') if w not in stopwords]
You have to split your strings apart somehow, probably around spaces. That takes something from:
'Hello, this is row3'
To
['Hello,', 'this', 'is', 'row3']
Iterating over that gives you full words, rather than individual characters.
Upvotes: 0
Reputation: 2146
First, make sure that list1 is a list of words, not an array of characters. Here I can give you a code snippet that you can leverage it maybe.
from nltk import word_tokenize
from nltk.corpus import stopwords
english_stopwords = stopwords.words('english') # get english stop words
# test document
document = '''A moody child and wildly wise
Pursued the game with joyful eyes
'''
# first tokenize your document to a list of words
words = word_tokenize(document)
print(words)
# the remove all stop words
content = [w for w in words if w.lower() not in english_stopwords]
print(content)
The output will be:
['A', 'moody', 'child', 'and', 'wildly', 'wise', 'Pursued', 'the', 'game', 'with', 'joyful', 'eyes']
['moody', 'child', 'wildly', 'wise', 'Pursued', 'game', 'joyful', 'eyes']
Upvotes: 3