Reputation: 172
I am trying to grab a large sample of data from a corpus, and establishing what proportion of the tokens are stop-words.
from sussex_nltk.corpus_readers import MedlineCorpusReader
from nltk.corpus import stopwords
mcr = MedlineCorpusReader()
sample_size = 10000
stopwords = stopwords.words('english')
raw_sentences = mcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]
filter_tok=[[sentence.isalpha() for sentence in sentence and sentence not in stopwords] for sentence in tokenised_sentences]
raw_vocab_size = vocabulary_size(tokenised_sentences)
filter_vocab_size = vocabulary_size(filter_tok)
print("Stopwords produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
100*(raw_vocab_size - filter_vocab_size)/raw_vocab_size,raw_vocab_size,filter_vocab_size))
Though even after I tokenise my list, I still can't seem to traverse through it. Believe the problem is rooted on line 11, though I am unsure how to iterate with 2 different objects, both .isalpha() and stopwords.
Upvotes: 0
Views: 27
Reputation: 1196
I know very little about the libraries you are using, but I know something about list comprehensions. The correct syntax is
[element for element in iterable if condition]
But you used
[element for element in iterable and condition]
So Python interpreted iterable and condition
(or in your example sentence and sentence not in stopwords
) as one expression. The result is a boolean and not iterable, so it throws a TypeError.
Just replace and
with if
and it will probably work. The nested list comprehensions are otherwise correct. I just wouldn't recommend having the same name for the element and the iterable (sentence
), because that can lead to confusion.
Upvotes: 2