Emil
Emil

Reputation: 1722

Memory efficient way to create a set from a list of lists in Python

I have a list of lists, in which each inner-list is a tokenized text, so its length is the number of words in the text.

corpus = [['this', 'is', 'text', 'one'], ['this', 'is', 'text', 'two']]

Now, I want to create a set that contains all unique tokens from the corpus. For the above example, the desired output would be:

{'this', 'is', 'text', 'one', 'two}

Currently, I have:

all_texts_list = list(chain(*corpus))
vocabulary = set(all_texts_list)

But this seems a memory-inefficient way of doing it.

Is there a more efficient way to obtain this set?


I found this link. However, there they want to find the set of unique lists and not the set of unique elements from the list.

Upvotes: 1

Views: 111

Answers (1)

Vishal Singh
Vishal Singh

Reputation: 6234

You can use a simple for loop with set update operation.

vocabulary = set()

for tokens in corpus:
    vocabulary.update(tokens)

Output:

{'this', 'one', 'text', 'two', 'is'}

Upvotes: 1

Related Questions