Reputation: 85
How can I check and remove duplicate values from items in a dictionary? I have a large data set so I'm looking for an efficient method. The following is an example of values in a dictionary that contains a duplicate:
'word': [('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])]
needs to become
'word': [('769817', [6]), ('769819', [4, 10])]
Upvotes: 1
Views: 4634
Reputation: 3123
How about that?
def remove_duplicates(d: dict):
unique_values = set(d.values())
o = {}
for k, v in d.items():
if v in unique_values:
o[k] = v
unique_values.remove(v)
return o
Upvotes: 0
Reputation: 3525
This problem essentially boils down to removing duplicates from a list of unhashable types, for which converting to a set does not possible.
One possible method is to check for membership in the current value while building up a new list value.
d = {'word': [('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])]}
for k, v in d.items():
new_list = []
for item in v:
if item not in new_list:
new_list.append(item)
d[k] = new_list
Alternatively, use groupby()
for a more concise answer, although potentially slower (the list must be sorted first, if it is, then it is faster than doing a membership check).
import itertools
d = {'word': [('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])]}
for k, v in d.items():
v.sort()
d[k] = [item for item, _ in itertools.groupby(v)]
Output -> {'word': [('769817', [6]), ('769819', [4, 10])]}
Upvotes: 1
Reputation: 43235
You can uniqify the items based on the hash they generate. Hash could be anything, a sorted json.dumps
, or cPickle.dumps
.
This one liner can uniqify your dict as required.
>>> d = {'word': [('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])]}
>>> import json
>>> { k: { json.dumps(x,sort_keys = True):x for x in v}.values() for k,v in d.iteritems()}
{'word': [('769817', [6]), ('769819', [4, 10])]}
Upvotes: 0
Reputation: 22697
your_list = [('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])]
new = []
for x in your_list:
if x not in new: new.append(x)
print(new)
>>>[('769817', [6]), ('769819', [4, 10])]
Upvotes: 0
Reputation: 155363
Strikethrough applied to original question before edits, left for posterity:
You're not using a dict
at all, just a list
of two-tuple
s, where the second element in each tuple
is itself a list
. If you actually want a dict
,
dict([('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])])
will convert it, and uniquify by key (so you'd end up with {'769817': [6], '769819': [4, 10]}
, though it loses order, and doesn't pay attention to whether the values (the sub-list
s) are unique or not (it just keeps the last pairing for a given key).
If you need to uniquify adjacent duplicates (where the values are important to uniqueness) while preserving order, and don't want/need a real dict
, use itertools.groupby
:
import itertools
nonuniq = [('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])]
uniq = [k for k, g in itertools.groupby(nonuniq)]
# uniq is [('769817', [6]), ('769819', [4, 10])]
# but it wouldn't work if the input was
# [('769819', [4, 10]), ('769817', [6]), ('769819', [4, 10])]
# because the duplicates aren't adjacent
If you need to collapse non-adjacent duplicates, and don't need to preserve order (or sorted order is fine), you can use groupby
to get a O(n log n)
solution (as opposed to naive solutions that create a new list and avoid duplicates by checking for presence in the new list at O(n^2)
complexity, or set
based solutions that would be O(n)
but require you to convert sub-list
s in your data to tuple
s to make them hashable):
# Only difference is sorting nonuniq before grouping
uniq = [k for k, g in itertools.groupby(sorted(nonuniq))]
# uniq is [('769817', [6]), ('769819', [4, 10])]
Upvotes: 0
Reputation: 16940
How about this: I am just focusing on the list part:
>>> s = [('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])]
>>> [(x,y) for x,y in {key: value for (key, value) in s}.items()]
[('769817', [6]), ('769819', [4, 10])]
>>>
Upvotes: 0
Reputation: 60954
You have a list, not a dictionary. Python dictionaries may have only one value for each key. Try
my_dict = dict([('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])])
result:
{'769817': [6], '769819': [4, 10]}
a Python dictionary. For more information https://docs.python.org/3/tutorial/datastructures.html#dictionaries
Upvotes: 0