bemzoo
bemzoo

Reputation: 172

Iterate 2 objects with Nested Loop List Comprehension with Tokenisers

I am trying to grab a large sample of data from a corpus, and establishing what proportion of the tokens are stop-words.

from sussex_nltk.corpus_readers import MedlineCorpusReader
from nltk.corpus import stopwords

mcr = MedlineCorpusReader()
sample_size = 10000
stopwords = stopwords.words('english')

raw_sentences = mcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]

filter_tok=[[sentence.isalpha() for sentence in sentence and sentence not in stopwords] for sentence in tokenised_sentences]

raw_vocab_size = vocabulary_size(tokenised_sentences)
filter_vocab_size = vocabulary_size(filter_tok)
print("Stopwords produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
    100*(raw_vocab_size - filter_vocab_size)/raw_vocab_size,raw_vocab_size,filter_vocab_size))  

Though even after I tokenise my list, I still can't seem to traverse through it. Believe the problem is rooted on line 11, though I am unsure how to iterate with 2 different objects, both .isalpha() and stopwords.

Upvotes: 0

Views: 27

Answers (1)

BurningKarl
BurningKarl

Reputation: 1196

I know very little about the libraries you are using, but I know something about list comprehensions. The correct syntax is

[element for element in iterable if condition]

But you used

[element for element in iterable and condition]

So Python interpreted iterable and condition (or in your example sentence and sentence not in stopwords) as one expression. The result is a boolean and not iterable, so it throws a TypeError.

Just replace and with if and it will probably work. The nested list comprehensions are otherwise correct. I just wouldn't recommend having the same name for the element and the iterable (sentence), because that can lead to confusion.

Upvotes: 2

Related Questions