Reputation: 159
Suppose I have a vocabulary: ['hello','how','are','you']. I have a corpus of many texts, for example: ['hello','how','how']. Is there any efficient way to encode this text into a list of integer, for example if I assign 'hello' = 1, 'how' = 2, 'are' = 3, 'you' = 4, then my text above will be encoded as [1,2,2].
My context: I have to encode a corpus of about 150,000 texts. The size of vocabulary is about 200,000. In general, each text contains about <200 words.
I tried the following code but it seems not efficient. It takes about 2 seconds/text, so it would take me 8-9 hours to finish.
tokens_to_index = [[vocabulary.index(word)+1 for word in text] for text in corpus]
Upvotes: 0
Views: 525
Reputation: 66
try using a dictionary instead
vocabulary = dict(zip(vocabulary, range(1, len(vocabulary)+1) ))
def tokens_to_index(corpus):
return [[vocabulary[word] for word in text] for text in corpus]
Upvotes: 2