user213717
user213717

Reputation: 11

Removing stop words from tokenized text using NLTK: TypeError

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tokenize import PunktSentenceTokenizer
from nltk.stem import WordNetLemmatizer
import re
import time

txt = input()

snt_tkn = sent_tokenize(txt)

wrd_tkn = [word_tokenize(s) for s in snt_tkn]

stp_wrd = set(stopwords.words("english"))

flt_snt = [w for w in wrd_tkn if not w in stp_wrd]

print(flt_snt)

returns the following:

Traceback (most recent call last):
  File "compiler.py", line 19, in 
    flt_snt = [w for w in wrd_tkn if not w in stp_wrd]
  File "compiler.py", line 19, in 
    flt_snt = [w for w in wrd_tkn if not w in stp_wrd]
TypeError: unhashable type: 'list'

I'd like to know, if possible, how to return the tokenized text with stop words removed without editing wrd_tkn.

Upvotes: 1

Views: 375

Answers (2)

The error says that list is unhasahble. You might try to make it hashable, but lists are not hashable because they are mutable. Try to convert a list to a tuple that is not mutable and that is hashable. It can be done by constructor function

immutable_list = tuple(some_list)

Upvotes: 2

user213717
user213717

Reputation: 11

For future reference, the resolution is the following:

change

flt_snt = [w for w in wrd_tkn if not w in stp_wrd]

to

flt_snt = [[w for w in s if not w in stp_wrd]for s in wrd_tkn]

Upvotes: 0

Related Questions