Devin Lee
Devin Lee

Reputation: 573

Nested List Iteration

I was attempting some preprocessing on nested list before attempting a small word2vec and encounter an issue as follow:

corpus = ['he is a brave king', 'she is a kind queen', 'he is a young boy', 'she is a gentle girl']

corpus = [_.split(' ') for _ in corpus]

[['he', 'is', 'a', 'brave', 'king'], ['she', 'is', 'a', 'kind', 'queen'], ['he', 'is', 'a', 'young', 'boy'], ['she', 'is', 'a', 'gentle', 'girl']]

So the output above was given as a nested list & I intended to remove the stopwords e.g. 'is', 'a'.

for _ in range(0, len(corpus)):
     for x in corpus[_]:
         if x == 'is' or x == 'a':
             corpus[_].remove(x)

[['he', 'a', 'brave', 'king'], ['she', 'a', 'kind', 'queen'], ['he', 'a', 'young', 'boy'], ['she', 'a', 'gentle', 'girl']]

The output seems indicating that the loop skipped to the next sub-list after removing 'is' in each sub-list instead of iterating entirely.

What is the reasoning behind this? Index? If so, how to resolve assuming I'd like to retain the nested structure.

Upvotes: 1

Views: 130

Answers (3)

iGian
iGian

Reputation: 11183

Maybe you can define a custom method to reject elements matching a certain condition. Similar to itertools (for example: itertools.dropwhile).

def reject_if(predicate, iterable):
  for element in iterable:
    if not predicate(element):
      yield element

Once you have the method in place, you can use this way:

stopwords = ['is', 'and', 'a']
[ list(reject_if(lambda x: x in stopwords, ary)) for ary in corpus ]
#=> [['he', 'brave', 'king'], ['she', 'kind', 'queen'], ['he', 'young', 'boy'], ['she', 'gentle', 'girl']]

Upvotes: 1

Mohit Singh
Mohit Singh

Reputation: 1

nested = [input()]

nested = [i.split() for i in nested]

Upvotes: 0

Sheldore
Sheldore

Reputation: 39042

All you code is correct except a minor change: Use [:] to iterate over the contents using a copy of the list and avoid doing changes via reference to the original list. Specifically, you create a copy of a list as lst_copy = lst[:]. This is one way to copy among several others (see here for comprehensive ways). When you iterate through the original list and modify the list by removing items, the counter creates the problem which you observe.

for _ in range(0, len(corpus)):
     for x in corpus[_][:]: # <--- create a copy of the list using [:]
         if x == 'is' or x == 'a':
             corpus[_].remove(x)

OUTPUT

[['he', 'brave', 'king'],
 ['she', 'kind', 'queen'],
 ['he', 'young', 'boy'],
 ['she', 'gentle', 'girl']]

Upvotes: 2

Related Questions