Reputation: 13
When i pass a list of strings into this function i want it to return a matrix saying how many times each unique word appears in the string, but, i get a matrix with the values for the first string repeated 4 times. This is the code
def tf(corp):
words_set = set()
for i in corp:
a=i.split(' ')
for j in a:
words_set.add(j)
words_dict = {i:0 for i in words_set}
wcount=0
matr=list()
for doc in corp:
for worduni in words_dict:
count=0
for words in doc.split(' '):
if words==worduni:
count+=1
words_dict[worduni]=count/len(doc.split(' '))
print(words_dict)
matr.append(words_dict)
return matr
when i print the value of matr, i get
[{'the': 0.2,
'first': 0.2,
'document': 0.2,
'third': 0.0,
'is': 0.2,
'one': 0.0,
'and': 0.0,
'this': 0.2,
'second': 0.0},
{'the': 0.2,
'first': 0.2,
'document': 0.2,
'third': 0.0,
'is': 0.2,
'one': 0.0,
'and': 0.0,
'this': 0.2,
'second': 0.0},
{'the': 0.2,
'first': 0.2,
'document': 0.2,
'third': 0.0,
'is': 0.2,
'one': 0.0,
'and': 0.0,
'this': 0.2,
'second': 0.0},
{'the': 0.2,
'first': 0.2,
'document': 0.2,
'third': 0.0,
'is': 0.2,
'one': 0.0,
'and': 0.0,
'this': 0.2,
'second': 0.0}]
Upvotes: 1
Views: 647
Reputation: 718826
What your code is doing is repeatedly adding the same object (word_dict
) to matr
. Naturally, since matr
is a list it can handle this ... and you will have multiple references to the same dictionary. Meanwhile, you are updating the dictionary. So what you see when you print the list is the final state of the dictionary ... N times.
Now I suspect that you intended to save snapshots of the state of word_dict
in matr
. But if that's want to do, you need to save copies of word_dict
in matr
; e.g
matr.append(words_dict.copy())
On the other hand, if your intend to generate a separate word frequency dictionary for each doc
in corp
, then you need to move the creation and initialization of word_dict
inside the outer loop.
Separately to the above, the way you are counting the words and computing the frequency seems to be completely wrong. I am assuming that is what you are trying to do here.
Note: if you use more meaningful method and variable names and/or add appropriate comments to your code, it will be easier for other people to understand what your code is intended to do.
Upvotes: 1
Reputation: 7214
I modified this to get you non duplicated data that is identical to your print:
def tf(corp):
words_set = set()
for i in corp:
a=i.split(' ')
for j in a:
words_set.add(j)
words_dict = {i:0 for i in words_set}
wcount=0
matr=list()
for doc in corp:
for worduni in words_dict:
count=0
for words in doc.split(' '):
if words==worduni:
count+=1
words_dict[worduni]=count/len(doc.split(' '))
print(words_dict)
matr.append(words_dict.copy())
return matr
Upvotes: 0