Brutalized
Brutalized

Reputation: 85

remove duplicate values from items in a dictionary in Python

How can I check and remove duplicate values from items in a dictionary? I have a large data set so I'm looking for an efficient method. The following is an example of values in a dictionary that contains a duplicate:

'word': [('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])]

needs to become

'word': [('769817', [6]), ('769819', [4, 10])]

Upvotes: 1

Views: 4634

Answers (7)

Philippe Remy
Philippe Remy

Reputation: 3123

How about that?

    def remove_duplicates(d: dict):
        unique_values = set(d.values())
        o = {}
        for k, v in d.items():
           if v in unique_values:
                o[k] = v
                unique_values.remove(v)
        return o

Upvotes: 0

ospahiu
ospahiu

Reputation: 3525

This problem essentially boils down to removing duplicates from a list of unhashable types, for which converting to a set does not possible.

One possible method is to check for membership in the current value while building up a new list value.

d = {'word': [('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])]}
for k, v in d.items():
    new_list = []
    for item in v:
        if item not in new_list:
            new_list.append(item)
    d[k] = new_list

Alternatively, use groupby() for a more concise answer, although potentially slower (the list must be sorted first, if it is, then it is faster than doing a membership check).

import itertools

d = {'word': [('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])]}
for k, v in d.items():
    v.sort()
    d[k] = [item for item, _ in itertools.groupby(v)]

Output -> {'word': [('769817', [6]), ('769819', [4, 10])]}

Upvotes: 1

DhruvPathak
DhruvPathak

Reputation: 43235

You can uniqify the items based on the hash they generate. Hash could be anything, a sorted json.dumps, or cPickle.dumps. This one liner can uniqify your dict as required.

>>> d =  {'word': [('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])]}
>>> import json
>>> { k: { json.dumps(x,sort_keys = True):x for x in v}.values() for k,v in d.iteritems()}
{'word': [('769817', [6]), ('769819', [4, 10])]}

Upvotes: 0

levi
levi

Reputation: 22697

your_list = [('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])]
new = []
for x in your_list:
    if x not in new: new.append(x)

print(new)    
>>>[('769817', [6]), ('769819', [4, 10])]

Upvotes: 0

ShadowRanger
ShadowRanger

Reputation: 155363

Strikethrough applied to original question before edits, left for posterity: You're not using a dict at all, just a list of two-tuples, where the second element in each tuple is itself a list. If you actually want a dict,

dict([('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])])

will convert it, and uniquify by key (so you'd end up with {'769817': [6], '769819': [4, 10]}, though it loses order, and doesn't pay attention to whether the values (the sub-lists) are unique or not (it just keeps the last pairing for a given key).

If you need to uniquify adjacent duplicates (where the values are important to uniqueness) while preserving order, and don't want/need a real dict, use itertools.groupby:

import itertools
nonuniq = [('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])]
uniq = [k for k, g in itertools.groupby(nonuniq)]
# uniq is [('769817', [6]), ('769819', [4, 10])]
# but it wouldn't work if the input was
# [('769819', [4, 10]), ('769817', [6]), ('769819', [4, 10])]
# because the duplicates aren't adjacent

If you need to collapse non-adjacent duplicates, and don't need to preserve order (or sorted order is fine), you can use groupby to get a O(n log n) solution (as opposed to naive solutions that create a new list and avoid duplicates by checking for presence in the new list at O(n^2) complexity, or set based solutions that would be O(n) but require you to convert sub-lists in your data to tuples to make them hashable):

# Only difference is sorting nonuniq before grouping
uniq = [k for k, g in itertools.groupby(sorted(nonuniq))]
# uniq is [('769817', [6]), ('769819', [4, 10])]

Upvotes: 0

James Sapam
James Sapam

Reputation: 16940

How about this: I am just focusing on the list part:

>>> s = [('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])]
>>> [(x,y) for x,y in {key: value for (key, value) in s}.items()]
[('769817', [6]), ('769819', [4, 10])]
>>>

Upvotes: 0

Patrick Haugh
Patrick Haugh

Reputation: 60954

You have a list, not a dictionary. Python dictionaries may have only one value for each key. Try

my_dict = dict([('769817', [6]), ('769819', [4, 10]), ('769819', [4, 10])])

result:

{'769817': [6], '769819': [4, 10]}

a Python dictionary. For more information https://docs.python.org/3/tutorial/datastructures.html#dictionaries

Upvotes: 0

Related Questions