Will
Will

Reputation: 381

Iterating over tokens within lists within lists using for-loops in python (SpaCy)

I'm relatively new, so I might be making some really basic mistake, but from what I understand, you would iterate over tokens within a list-within-a-list in python as follows:

for each_list in full_list:
  for each_token in each_list:
    do whatever you wannna do

However, when using SpaCy, it seems like the first for-loop is iterating over the tokens rather than the lists.

So the code:

for eachlist in alice:
  if len(eachlist) > 5:
     print eachlist

(where alice is a list of lists and each list is a sentence containing tokenized words)

actually prints each word that is over 5 letters rather than each sentence which is longer than 5 words (which it should be doing if it was really on the "first level" for-loop.

And the code:

newalice = []
for eachlist in alice:
  for eachword in eachlist:
    #make a new list of lists where each list contains only words that are classified as nouns, adjectives, or verbs (with a few more specific stipulations)
    if (eachword.pos_ == 'NOUN' or eachword.pos_ == 'VERB' or eachword.pos_ == 'ADJ') and (eachword.dep_ != 'aux') and (eachword.dep_ != 'conj'):
        newalice.append([eachword])

returns the error: "TypeError: 'spacy.tokens.token.Token' object is not iterable."

The reason I want to do this in the nested for-loops is that I want newalice to be a list of lists (I still want to be able to iterate over the sentences, I just wanted to get rid of words I don't care about).

I don't know if I'm making some really basic error in my code, or if SpaCy is doing something weird, but either way I'd really appreciate any help on how to iterate over items in a list-in-a-list in SpaCy while keeping the integrity of the original lists.

Upvotes: 2

Views: 3235

Answers (1)

gdaras
gdaras

Reputation: 10119

Below is the code for iterating over elements of nested lists:

list_inst = [ ["this", " ", "is", " ", "a", " ", "sentence"], ["another", " ", "one"]]
for sentence in list_inst:
    for token in sentence:
        print(token, end="")
    print("")

I think that your misunderstanding comes from the fact that each sentence in spacy is not stored in a list but in a doc object. The doc object is iterable and contains the tokens but some extra information too.

Example code:

# iterate to sentences after spacy preprocessing
import spacy
nlp = spacy.load('en_core_web_sm')
doc1 = nlp("this is a sentence")
doc2 = nlp("another one")
list_inst = [doc1, doc2]
for doc in list_inst:
    for token in doc:
        print(token, end=" ")
    print("")

The outputs are identical.

Hope it helps!

Upvotes: 3

Related Questions