Reputation: 591
I have a vocabulary with the form of dic = {'a':30, 'the':29,....}, the key is the word, the value is its word count.
I have some sentences, like:
"this is a test"
"an apple"
....
In order to tokenize the sentences, each sentence will be encoded as the word index of dictionary. If the word in a sentence also exist in the dictionary, get this word's index; otherwise set the value to 0.
for example, I set the sentence dimension to 6, if the length of a sentence is smaller than 6, padding 0s to make it 6 dimension.
"this is the test" ----> [2, 0, 2, 4, 0, 0] "an apple" ----> [5, 0, 0, 0, 0, 0,]
Here is my sample code:
words=['the','a','an'] #use this list as my dictionary
X=[]
with open('test.csv','r') as infile:
for line in infile:
for word in line:
if word in words:
X.append(words.index(word))
else: X.append(0)
My code has some problem because the output is not correct; in addition, I have no idea how to set the sentence dimension and how to padding.
Upvotes: 0
Views: 2594
Reputation: 2121
There are a couple of issues with your code:
You're not tokenizing on a word, but a character. You need to split up each line into words
You're appending into one large list, instead of a list of lists representing each sentence/line
Like you said, you don't limit the size of the list
I also don't understand why you're using a list as a dictionary
I edited your code below, and I think it aligns better with your specifications:
words={'the': 2,'a': 1,'an': 3}
X=[]
with open('test.csv','r') as infile:
for line in infile:
# Inits the sublist to [0, 0, 0, 0, 0, 0]
sub_X = [0] * 6
# Enumerates each word in the list with an index
# split() splits a string by whitespace if no arg is given
for idx, word in enumerate(line.split()):
if word in words:
# Check if the idx is within bounds before accessing
if idx < 6:
sub_X[idx] = words[word]
# X represents the overall list and sub_X the sentence
X.append(sub_X)
Upvotes: 1