SiXUlm
SiXUlm

Reputation: 159

how to efficiently encode sequence of word to sequence of integers

Suppose I have a vocabulary: ['hello','how','are','you']. I have a corpus of many texts, for example: ['hello','how','how']. Is there any efficient way to encode this text into a list of integer, for example if I assign 'hello' = 1, 'how' = 2, 'are' = 3, 'you' = 4, then my text above will be encoded as [1,2,2].

My context: I have to encode a corpus of about 150,000 texts. The size of vocabulary is about 200,000. In general, each text contains about <200 words.

I tried the following code but it seems not efficient. It takes about 2 seconds/text, so it would take me 8-9 hours to finish.

tokens_to_index = [[vocabulary.index(word)+1 for word in text] for text in corpus]

Upvotes: 0

Views: 525

Answers (2)

Fadlullah Olawumi
Fadlullah Olawumi

Reputation: 66

try using a dictionary instead

vocabulary = dict(zip(vocabulary, range(1, len(vocabulary)+1) )) def tokens_to_index(corpus): return [[vocabulary[word] for word in text] for text in corpus]

Upvotes: 2

hemanth rs
hemanth rs

Reputation: 74

i'm not sure but Try dictionary u can use key: value pairs

Upvotes: 1

Related Questions