Memory efficient way to create a set from a list of lists in Python

Question

I have a list of lists, in which each inner-list is a tokenized text, so its length is the number of words in the text.

corpus = [['this', 'is', 'text', 'one'], ['this', 'is', 'text', 'two']]

Now, I want to create a set that contains all unique tokens from the corpus. For the above example, the desired output would be:

{'this', 'is', 'text', 'one', 'two}

Currently, I have:

all_texts_list = list(chain(*corpus))
vocabulary = set(all_texts_list)

But this seems a memory-inefficient way of doing it.

Is there a more efficient way to obtain this set?

I found this link. However, there they want to find the set of unique lists and not the set of unique elements from the list.

Vishal Singh · Accepted Answer

You can use a simple for loop with set update operation.

vocabulary = set()

for tokens in corpus:
    vocabulary.update(tokens)

Output:

{'this', 'one', 'text', 'two', 'is'}

Answers (1)