Vivek Sreedhar
Vivek Sreedhar

Reputation: 13

why is list().append() giving me duplicated values?

When i pass a list of strings into this function i want it to return a matrix saying how many times each unique word appears in the string, but, i get a matrix with the values for the first string repeated 4 times. This is the code

def tf(corp):
words_set = set()
for i in corp:
    a=i.split(' ')
    for j in a:
        words_set.add(j)
words_dict = {i:0 for i in words_set}
wcount=0
matr=list()
for doc in corp:
    for worduni in words_dict:
        count=0

        for words in doc.split(' '):
            if words==worduni:

                count+=1
        words_dict[worduni]=count/len(doc.split(' '))
    print(words_dict)
    matr.append(words_dict)   

return matr

when i print the value of matr, i get
[{'the': 0.2, 'first': 0.2, 'document': 0.2, 'third': 0.0, 'is': 0.2, 'one': 0.0, 'and': 0.0, 'this': 0.2, 'second': 0.0}, {'the': 0.2, 'first': 0.2, 'document': 0.2, 'third': 0.0, 'is': 0.2, 'one': 0.0, 'and': 0.0, 'this': 0.2, 'second': 0.0}, {'the': 0.2, 'first': 0.2, 'document': 0.2, 'third': 0.0, 'is': 0.2, 'one': 0.0, 'and': 0.0, 'this': 0.2, 'second': 0.0}, {'the': 0.2, 'first': 0.2, 'document': 0.2, 'third': 0.0, 'is': 0.2, 'one': 0.0, 'and': 0.0, 'this': 0.2, 'second': 0.0}]

Upvotes: 1

Views: 647

Answers (2)

Stephen C
Stephen C

Reputation: 718826

What your code is doing is repeatedly adding the same object (word_dict) to matr. Naturally, since matr is a list it can handle this ... and you will have multiple references to the same dictionary. Meanwhile, you are updating the dictionary. So what you see when you print the list is the final state of the dictionary ... N times.

Now I suspect that you intended to save snapshots of the state of word_dict in matr. But if that's want to do, you need to save copies of word_dict in matr; e.g

    matr.append(words_dict.copy())

On the other hand, if your intend to generate a separate word frequency dictionary for each doc in corp, then you need to move the creation and initialization of word_dict inside the outer loop.


Separately to the above, the way you are counting the words and computing the frequency seems to be completely wrong. I am assuming that is what you are trying to do here.


Note: if you use more meaningful method and variable names and/or add appropriate comments to your code, it will be easier for other people to understand what your code is intended to do.

Upvotes: 1

oppressionslayer
oppressionslayer

Reputation: 7214

I modified this to get you non duplicated data that is identical to your print:

def tf(corp):
  words_set = set()
  for i in corp:
      a=i.split(' ')
      for j in a:
          words_set.add(j)
  words_dict = {i:0 for i in words_set}
  wcount=0
  matr=list()
  for doc in corp:
    for worduni in words_dict:
       count=0

       for words in doc.split(' '):
          if words==worduni:
             count+=1
       words_dict[worduni]=count/len(doc.split(' '))
    print(words_dict)
    matr.append(words_dict.copy())   

  return matr

Upvotes: 0

Related Questions