how to efficiently encode sequence of word to sequence of integers

Question

Suppose I have a vocabulary: ['hello','how','are','you']. I have a corpus of many texts, for example: ['hello','how','how']. Is there any efficient way to encode this text into a list of integer, for example if I assign 'hello' = 1, 'how' = 2, 'are' = 3, 'you' = 4, then my text above will be encoded as [1,2,2].

My context: I have to encode a corpus of about 150,000 texts. The size of vocabulary is about 200,000. In general, each text contains about <200 words.

I tried the following code but it seems not efficient. It takes about 2 seconds/text, so it would take me 8-9 hours to finish.

tokens_to_index = [[vocabulary.index(word)+1 for word in text] for text in corpus]

Fadlullah Olawumi · Accepted Answer

try using a dictionary instead

vocabulary = dict(zip(vocabulary, range(1, len(vocabulary)+1) )) def tokens_to_index(corpus): return [[vocabulary[word] for word in text] for text in corpus]

how to efficiently encode sequence of word to sequence of integers

Answers (2)

Related Questions