A1010
A1010

Reputation: 360

How to obtain a set of dictionaries?

I have a list of dictionaries, but some of them are duplicates and I want to remove them (the duplicates).

The keys of the dict are a sequential number.

An example is the following:

[{1: {a:[1,2,3], b: 4}},
{2: {a:[4,5,6], d: 5}},
{3: {a:[1,2,3], b: 4}},
.....,
{1000: {a:[2,5,1], b: 99}},
]

Considering the previous example I would like to obtain:

[{1: {a:[1,2,3], b: 4}},
{2: {a:[4,5,6], d: 5}},
.....,
{1000: {a:[2,5,1], b: 99}},
]

In fact the dictionaries with keys 1 and 3 are identically in their values.

I tried with a set, but since dict is a not hashable type I am not able to do so.

How can i fix the problem?

EDIT

In my case the number of items inside the dict is not fix, so I can have:

[{1: {a:[1,2,3], b: 4}},
{2: {a:[4,5,6], d: 5}},
.....,
{1000: {a:[2,5,1], b: 99, c:["a","v"]}},
]

where the dict with keys 100 has three elements inside insted of two as the other shown.

Upvotes: 5

Views: 10985

Answers (5)

Jacques Gaudin
Jacques Gaudin

Reputation: 16958

You can define a custom hash of your dictionaries by subclassing dict:

class MyData(dict):

    def __hash__(self):
        return hash((k, repr(v)) for k, v in self.items())

l = [
    {1: {'a': [1, 2, 3], 'b': 4}},
    {2: {'a': [4, 5, 6], 'd': 5}},
    {3: {'b': 4, 'a': [1, 2, 3]}},
    {4: {'a': (4, 5, 6), 'd': 5}},
]

s = set([MyData(*d.values()) for d in l])

This is assuming that all the dictionaries in the list have only one key-value pair.

Upvotes: 3

Lante Dellarovere
Lante Dellarovere

Reputation: 1858

I don't know how big is your list and how many duplicates are in it, but, just in case, here is the basic solution.
It might be not efficient, but you don't have to worry about elements type:

import datetime as dt

data = [
    {1: {"b": 4, "a":[1,2,3]}},
    {2: {"a":[4,5,6], "d": 5}},
    {3: {"a":[1,2,3], "b": 4}},
    {4: {'a': dt.datetime(2019, 5, 10), 'd': set([4])}},
    {5: {'a': dt.datetime(2019, 5, 10), 'd': set([4])}},
    {6: {"a":[2,5,1], "b": 99}},
    {7: {"a":[5,2,1], "b": 99}},
    {8: {"a":(5,2,1), "b": 99}}
]



seen = []
output = []
for d in data:
    for k, v in d.items():
        if v not in seen:
            seen.append(v)
            output.append({k:v})

>>> print(output)
[{1: {'a': [1, 2, 3], 'b': 4}},
 {2: {'a': [4, 5, 6], 'd': 5}},
 {4: {'a': datetime.datetime(2019, 5, 10, 0, 0), 'd': {4}}},
 {6: {'a': [2, 5, 1], 'b': 99}},
 {7: {'a': [5, 2, 1], 'b': 99}},
 {8: {'a': (5, 2, 1), 'b': 99}}]

Upvotes: 0

William Bright
William Bright

Reputation: 535

This is the simplest solution I've been able to come up with assuming the nested dictionary like

{1: {'a': [1,2,3,5,79], 'b': 234 ...}}

as long as the only container inside the dictionary is a list like {'a': [1,2,3..]} then this will work. Or you can just add a simple check like the function below will show.


def serialize(dct):  # this is the sub {'a': [1,2,3]} dictionary
    tmp = []
    for value in dct.values():
        if type(value) == list:
            tmp.append(tuple(value))
        else:
            tmp.append(value)
    return tuple(tmp)

def clean_up(lst):
    seen = set()
    clean = []
    for dct in lst:
        # grabs the 1..1000 key inside the primary dictionary
        # assuming there is only 1 key being the "id" or the counter etc...
        key = list(dct.keys())[0] 
        serialized = serialize(dct[key])
        if serialized not in seen:
            seen.add(serialized)
            clean.append(dct)
    return clean

So the function serialize grabs the nested dictionary and creates a simple tuple from the contents. This is then checked if its in the set "seen" to verify its unique.

benchmarks

generate a data set using some random values just because

lst = []
for i in range(1,1000):
    dct = {
        i: {
            random.choice(string.ascii_letters): [n for n in range(random.randint(0,i))], 
            random.choice(string.ascii_letters): random.randint(0,i)
        }
    }
    lst.append(dct)

Running the benchmarks:


%timeit clean_up(lst)
3.25 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit jdhesa(lst)
126 ms ± 606 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

As seen the function clean_up is significantly faster but simpler (not necessarily a good thing) in its implementation checks.

Upvotes: 2

javidcf
javidcf

Reputation: 59701

You can maybe use a function like this to turn your objects into something hashable:

def make_hashable(o):
    if isinstance(o, dict):
        return frozenset((k, make_hashable(v)) for k, v in o.items())
    elif isinstance(o, list):
        return tuple(make_hashable(elem) for elem in o)
    elif isinstance(o, set):
        return frozenset(make_hashable(elem) for elem in o)
    else:
        return o

Then you keep a set of seen objects and keep only the keys of each dictionary containing objects that you did not see before:

lst = [
    {1: {'a':[1,2,3], 'b': 4}},
    {2: {'a':[4,5,6], 'd': 5}},
    {3: {'a':[1,2,3], 'b': 4}},
]

seen = set()
result_keys = []
for elem in lst:
    keep_keys = []
    for k, v in elem.items():
        v_hashable = make_hashable(v)
        if v_hashable not in seen:
            seen.add(v_hashable)
            keep_keys.append(k)
    result_keys.append(keep_keys)
result = [{k: elem[k] for k in keys} for elem, keys in zip(lst, result_keys) if keys]
print(result)
# [{1: {'a': [1, 2, 3], 'b': 4}}, {2: {'a': [4, 5, 6], 'd': 5}}]

Note that, as blhsing notes in the comments, this has some limitations, such as considering (1, 2) and [1, 2] equals, as well as {1: 2} and {(1, 2)}. Also, some types may not be convertible to an equivalent hashable type.

EDIT: As a_guest suggests, you can work around the type ambiguity by returning the type itself along with the hashable object in make_hashable:

def make_hashable(o):
    t = type(o)
    if isinstance(o, dict):
        o = frozenset((k, make_hashable(v)) for k, v in o.items())
    elif isinstance(o, list):
        o = tuple(make_hashable(elem) for elem in o)
    elif isinstance(o, set):
        o = frozenset(make_hashable(elem) for elem in o)
    return t, o

If you don't need to look into the hashable object, this will easily provide strict type comparison. Note in this case even things like {1, 2} and frozenset({1, 2}) will be different.

Upvotes: 3

blhsing
blhsing

Reputation: 106553

To get around the limitation of @jdehesa's solution, where [1, 2] would be treated as a duplicate as (1, 2), you can preserve the data types by using pprint.pformat instead to serialize the data structure. Since pprint.pformat sorts dicts by keys and sets by items, {1: 2, 3: 4} would be properly considered the same as {3: 4, 1: 2}, but [1, 2] would not be considered a duplicate to (1, 2):

from pprint import pformat
lst = [
    {1: {'a': [1, 2, 3], 'b': 4}},
    {2: {'a': [4, 5, 6], 'd': 5}},
    {3: {'b': 4, 'a': [1, 2, 3]}},
    {4: {'a': (4, 5, 6), 'd': 5}},
]
seen = set()
output = []
for d in lst:
    for k, v in d.items():
        signature = pformat(v)
        if signature not in seen:
            seen.add(signature)
            output.append({k: v})

output becomes:

[{1: {'a': [1, 2, 3], 'b': 4}},
 {2: {'a': [4, 5, 6], 'd': 5}},
 {4: {'a': (4, 5, 6), 'd': 5}}]

Upvotes: 5

Related Questions