Reputation: 392
The code below just produce the term-document matrix. Can we make it more efficient?
PREPROCESSED = ['He is a good boy','he loves studying']
DICTIONARY = ['He', 'is', 'a', 'good', 'boy', 'loves', 'studying']
MATRIX = []
for sent in PREPROCESSED:
temp = []
for i in DICTIONARY:
count = 0
for words in sent.split():
if i == words:
count = count + 1
temp.append(count)
test = 0
for i in temp:
if i != 0:
test = 1
if test == 1:
MATRIX.append(temp)
del temp
Upvotes: 0
Views: 24
Reputation: 357
I tried to rework the algorithm, but you can't really do better than
The code with some minor(but good if the lists grow a lot) changes:
PREPROCESSED = ['He is a good boy','he loves studying']
DICTIONARY = ['He', 'is', 'a', 'good', 'boy', 'loves', 'studying']
MATRIX = []
for sent in PREPROCESSED:
temp = []
tmpSent = sent.split() #runs once instead of len(DICTIONARY) times
for i in DICTIONARY:
count = 0
for word in tmpSent:
if i == word:
count += 1
temp.append(count)
for i in temp:
if i != 0:
# removes an extra test
MATRIX.append(temp)
break
del temp
print(MATRIX)
Upvotes: 1