python sentence tokenizing according to the word index of dictionary

Question

I have a vocabulary with the form of dic = {'a':30, 'the':29,....}, the key is the word, the value is its word count.

I have some sentences, like:

"this is a test"

"an apple"

....

In order to tokenize the sentences, each sentence will be encoded as the word index of dictionary. If the word in a sentence also exist in the dictionary, get this word's index; otherwise set the value to 0.

for example, I set the sentence dimension to 6, if the length of a sentence is smaller than 6, padding 0s to make it 6 dimension.

"this is the test" ----> [2, 0, 2, 4, 0, 0] "an apple" ----> [5, 0, 0, 0, 0, 0,]

Here is my sample code:

words=['the','a','an'] #use this list as my dictionary
X=[]

with open('test.csv','r') as infile:
    for line in infile:
        for word in line:
            if word in words:
                X.append(words.index(word))
            else: X.append(0)

My code has some problem because the output is not correct; in addition, I have no idea how to set the sentence dimension and how to padding.

python sentence tokenizing according to the word index of dictionary

Answers (1)

Related Questions