Kun
Kun

Reputation: 591

python sentence tokenizing according to the word index of dictionary

I have a vocabulary with the form of dic = {'a':30, 'the':29,....}, the key is the word, the value is its word count.

I have some sentences, like:

"this is a test"

"an apple"

....

In order to tokenize the sentences, each sentence will be encoded as the word index of dictionary. If the word in a sentence also exist in the dictionary, get this word's index; otherwise set the value to 0.

for example, I set the sentence dimension to 6, if the length of a sentence is smaller than 6, padding 0s to make it 6 dimension.

"this is the test" ----> [2, 0, 2, 4, 0, 0] "an apple" ----> [5, 0, 0, 0, 0, 0,]

Here is my sample code:

words=['the','a','an'] #use this list as my dictionary
X=[]

with open('test.csv','r') as infile:
    for line in infile:
        for word in line:
            if word in words:
                X.append(words.index(word))
            else: X.append(0)

My code has some problem because the output is not correct; in addition, I have no idea how to set the sentence dimension and how to padding.

Upvotes: 0

Views: 2594

Answers (1)

ajoseps
ajoseps

Reputation: 2121

There are a couple of issues with your code:

  1. You're not tokenizing on a word, but a character. You need to split up each line into words

  2. You're appending into one large list, instead of a list of lists representing each sentence/line

  3. Like you said, you don't limit the size of the list

  4. I also don't understand why you're using a list as a dictionary

I edited your code below, and I think it aligns better with your specifications:

words={'the': 2,'a': 1,'an': 3}
X=[]

with open('test.csv','r') as infile:
    for line in infile:
        # Inits the sublist to [0, 0, 0, 0, 0, 0]
        sub_X = [0] * 6

        # Enumerates each word in the list with an index
        # split() splits a string by whitespace if no arg is given
        for idx, word in enumerate(line.split()):
            if word in words:
                 # Check if the idx is within bounds before accessing
                 if idx < 6:
                     sub_X[idx] = words[word]

        # X represents the overall list and sub_X the sentence
        X.append(sub_X)

Upvotes: 1

Related Questions