Reputation: 1722
I have a list of lists, in which each inner-list is a tokenized text, so its length is the number of words in the text.
corpus = [['this', 'is', 'text', 'one'], ['this', 'is', 'text', 'two']]
Now, I want to create a set that contains all unique tokens from the corpus. For the above example, the desired output would be:
{'this', 'is', 'text', 'one', 'two}
Currently, I have:
all_texts_list = list(chain(*corpus))
vocabulary = set(all_texts_list)
But this seems a memory-inefficient way of doing it.
Is there a more efficient way to obtain this set?
I found this link. However, there they want to find the set of unique lists and not the set of unique elements from the list.
Upvotes: 1
Views: 111
Reputation: 6234
You can use a simple for loop with set update
operation.
vocabulary = set()
for tokens in corpus:
vocabulary.update(tokens)
Output:
{'this', 'one', 'text', 'two', 'is'}
Upvotes: 1