Reputation: 2398
I have a list of dict where a particular value is repeated multiple times, and I would like to remove the duplicate values.
My list:
te = [
{
"Name": "Bala",
"phone": "None"
},
{
"Name": "Bala",
"phone": "None"
},
{
"Name": "Bala",
"phone": "None"
},
{
"Name": "Bala",
"phone": "None"
}
]
function to remove duplicate values:
def removeduplicate(it):
seen = set()
for x in it:
if x not in seen:
yield x
seen.add(x)
When I call this function I get generator object
.
<generator object removeduplicate at 0x0170B6E8>
When I try to iterate over the generator I get TypeError: unhashable type: 'dict'
Is there a way to remove the duplicate values or to iterate over the generator
Upvotes: 11
Views: 39927
Reputation: 755
I just use md5 to compare everything.
filtered_json = []
md5_list = []
for item in json_fin:
md5_result = hashlib.md5(json.dumps(item, separators=(',', ':')).encode("utf-8")).hexdigest()
if md5_result not in md5_list:
md5_list.append(md5_result)
filtered_json.append(item)
Upvotes: 1
Reputation: 87084
You can still use a set
for duplicate detection, you just need to convert the dictionary into something hashable such as a tuple
. Your dictionaries can be converted to tuples by tuple(d.items())
where d
is a dictionary. Applying that to your generator function:
def removeduplicate(it):
seen = set()
for x in it:
t = tuple(x.items())
if t not in seen:
yield x
seen.add(t)
>>> for d in removeduplicate(te):
... print(d)
{'phone': 'None', 'Name': 'Bala'}
>>> te.append({'Name': 'Bala', 'phone': '1234567890'})
>>> te.append({'Name': 'Someone', 'phone': '1234567890'})
>>> for d in removeduplicate(te):
... print(d)
{'phone': 'None', 'Name': 'Bala'}
{'phone': '1234567890', 'Name': 'Bala'}
{'phone': '1234567890', 'Name': 'Someone'}
This provides faster lookup (avg. O(1)) than a "seen" list
(O(n)). Whether it is worth the extra computation of converting every dict into a tuple depends on the number of dictionaries that you have and how many duplicates there are. If there are a lot of duplicates, a "seen" list
will grow quite large, and testing whether a dict has already been seen could become an expensive operation. This might justify the tuple conversion - you would have to test/profile it.
Upvotes: 2
Reputation: 5292
You can easily remove duplicate keys by dictionary comprehension, since dictionary does not allow duplicate keys, as below-
te = [
{
"Name": "Bala",
"phone": "None"
},
{
"Name": "Bala",
"phone": "None"
},
{
"Name": "Bala",
"phone": "None"
},
{
"Name": "Bala",
"phone": "None"
},
{
"Name": "Bala1",
"phone": "None"
}
]
unique = { each['Name'] : each for each in te }.values()
print unique
Output-
[{'phone': 'None', 'Name': 'Bala1'}, {'phone': 'None', 'Name': 'Bala'}]
Upvotes: 39
Reputation: 22292
Because you can't add a dict
to set
. From this question:
You're trying to use a
dict
as a key to anotherdict
or in aset
. That does not work because the keys have to be hashable.As a general rule, only immutable objects (strings, integers, floats, frozensets, tuples of immutables) are hashable (though exceptions are possible).
>>> foo = dict()
>>> bar = set()
>>> bar.add(foo)
Traceback (most recent call last):
File "<input>", line 1, in <module>
TypeError: unhashable type: 'dict'
>>>
Instead, you're already using if x not in seen
, so just use a list:
>>> te = [
... {
... "Name": "Bala",
... "phone": "None"
... },
... {
... "Name": "Bala",
... "phone": "None"
... },
... {
... "Name": "Bala",
... "phone": "None"
... },
... {
... "Name": "Bala",
... "phone": "None"
... }
... ]
>>> def removeduplicate(it):
... seen = []
... for x in it:
... if x not in seen:
... yield x
... seen.append(x)
>>> removeduplicate(te)
<generator object removeduplicate at 0x7f3578c71ca8>
>>> list(removeduplicate(te))
[{'phone': 'None', 'Name': 'Bala'}]
>>>
Upvotes: 7