Reputation: 2398

Remove duplicate JSON objects from list in python

I have a list of dict where a particular value is repeated multiple times, and I would like to remove the duplicate values.

My list:

te = [
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
        "Name": "Bala",
        "phone": "None"
      }
    ]

function to remove duplicate values:

def removeduplicate(it):
    seen = set()
    for x in it:
        if x not in seen:
            yield x
            seen.add(x)

When I call this function I get generator object.

<generator object removeduplicate at 0x0170B6E8>

When I try to iterate over the generator I get TypeError: unhashable type: 'dict'

Is there a way to remove the duplicate values or to iterate over the generator

Upvotes: 11

Answers (4)

Benny

Reputation: 755

I just use md5 to compare everything.

filtered_json = []
md5_list = []

for item in json_fin:
    md5_result = hashlib.md5(json.dumps(item, separators=(',', ':')).encode("utf-8")).hexdigest()
    if md5_result not in md5_list:
        md5_list.append(md5_result)
        filtered_json.append(item)

Upvotes: 1

mhawke

Reputation: 87084

You can still use a set for duplicate detection, you just need to convert the dictionary into something hashable such as a tuple. Your dictionaries can be converted to tuples by tuple(d.items()) where d is a dictionary. Applying that to your generator function:

def removeduplicate(it):
    seen = set()
    for x in it:
        t = tuple(x.items())
        if t not in seen:
            yield x
            seen.add(t)

>>> for d in removeduplicate(te):
...    print(d)
{'phone': 'None', 'Name': 'Bala'}

>>> te.append({'Name': 'Bala', 'phone': '1234567890'})
>>> te.append({'Name': 'Someone', 'phone': '1234567890'})

>>> for d in removeduplicate(te):
...    print(d)
{'phone': 'None', 'Name': 'Bala'}
{'phone': '1234567890', 'Name': 'Bala'}
{'phone': '1234567890', 'Name': 'Someone'}

This provides faster lookup (avg. O(1)) than a "seen" list (O(n)). Whether it is worth the extra computation of converting every dict into a tuple depends on the number of dictionaries that you have and how many duplicates there are. If there are a lot of duplicates, a "seen" list will grow quite large, and testing whether a dict has already been seen could become an expensive operation. This might justify the tuple conversion - you would have to test/profile it.

Upvotes: 2

Learner

Reputation: 5292

You can easily remove duplicate keys by dictionary comprehension, since dictionary does not allow duplicate keys, as below-

te = [
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
        "Name": "Bala",
        "phone": "None"
      },
      {
          "Name": "Bala1",
          "phone": "None"
      }      
    ]

unique = { each['Name'] : each for each in te }.values()

print unique

Output-

[{'phone': 'None', 'Name': 'Bala1'}, {'phone': 'None', 'Name': 'Bala'}]

Upvotes: 39

Remi Guan

Reputation: 22292

Because you can't add a dict to set. From this question:

You're trying to use a dict as a key to another dict or in a set. That does not work because the keys have to be hashable.

As a general rule, only immutable objects (strings, integers, floats, frozensets, tuples of immutables) are hashable (though exceptions are possible).

>>> foo = dict()
>>> bar = set()
>>> bar.add(foo)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
TypeError: unhashable type: 'dict'
>>>

Instead, you're already using if x not in seen, so just use a list:

>>> te = [
...       {
...         "Name": "Bala",
...         "phone": "None"
...       },
...       {
...         "Name": "Bala",
...         "phone": "None"
...       },
...       {
...         "Name": "Bala",
...         "phone": "None"
...       },
...       {
...         "Name": "Bala",
...         "phone": "None"
...       }
...     ]

>>> def removeduplicate(it):
...     seen = []
...     for x in it:
...         if x not in seen:
...             yield x
...             seen.append(x)

>>> removeduplicate(te)
<generator object removeduplicate at 0x7f3578c71ca8>

>>> list(removeduplicate(te))
[{'phone': 'None', 'Name': 'Bala'}]
>>>

Upvotes: 7

Remove duplicate JSON objects from list in python

Answers (4)

Related Questions