Reputation: 360
I have a list of dictionaries, but some of them are duplicates and I want to remove them (the duplicates).
The keys of the dict are a sequential number.
An example is the following:
[{1: {a:[1,2,3], b: 4}},
{2: {a:[4,5,6], d: 5}},
{3: {a:[1,2,3], b: 4}},
.....,
{1000: {a:[2,5,1], b: 99}},
]
Considering the previous example I would like to obtain:
[{1: {a:[1,2,3], b: 4}},
{2: {a:[4,5,6], d: 5}},
.....,
{1000: {a:[2,5,1], b: 99}},
]
In fact the dictionaries with keys 1 and 3 are identically in their values.
I tried with a set, but since dict is a not hashable type I am not able to do so.
How can i fix the problem?
EDIT
In my case the number of items inside the dict is not fix, so I can have:
[{1: {a:[1,2,3], b: 4}},
{2: {a:[4,5,6], d: 5}},
.....,
{1000: {a:[2,5,1], b: 99, c:["a","v"]}},
]
where the dict with keys 100 has three elements inside insted of two as the other shown.
Upvotes: 5
Views: 10985
Reputation: 16958
You can define a custom hash of your dictionaries by subclassing dict
:
class MyData(dict):
def __hash__(self):
return hash((k, repr(v)) for k, v in self.items())
l = [
{1: {'a': [1, 2, 3], 'b': 4}},
{2: {'a': [4, 5, 6], 'd': 5}},
{3: {'b': 4, 'a': [1, 2, 3]}},
{4: {'a': (4, 5, 6), 'd': 5}},
]
s = set([MyData(*d.values()) for d in l])
This is assuming that all the dictionaries in the list have only one key-value pair.
Upvotes: 3
Reputation: 1858
I don't know how big is your list and how many duplicates are in it, but, just in case, here is the basic solution.
It might be not efficient, but you don't have to worry about elements type:
import datetime as dt
data = [
{1: {"b": 4, "a":[1,2,3]}},
{2: {"a":[4,5,6], "d": 5}},
{3: {"a":[1,2,3], "b": 4}},
{4: {'a': dt.datetime(2019, 5, 10), 'd': set([4])}},
{5: {'a': dt.datetime(2019, 5, 10), 'd': set([4])}},
{6: {"a":[2,5,1], "b": 99}},
{7: {"a":[5,2,1], "b": 99}},
{8: {"a":(5,2,1), "b": 99}}
]
seen = []
output = []
for d in data:
for k, v in d.items():
if v not in seen:
seen.append(v)
output.append({k:v})
>>> print(output)
[{1: {'a': [1, 2, 3], 'b': 4}},
{2: {'a': [4, 5, 6], 'd': 5}},
{4: {'a': datetime.datetime(2019, 5, 10, 0, 0), 'd': {4}}},
{6: {'a': [2, 5, 1], 'b': 99}},
{7: {'a': [5, 2, 1], 'b': 99}},
{8: {'a': (5, 2, 1), 'b': 99}}]
Upvotes: 0
Reputation: 535
This is the simplest solution I've been able to come up with assuming the nested dictionary like
{1: {'a': [1,2,3,5,79], 'b': 234 ...}}
as long as the only container inside the dictionary is a list like {'a': [1,2,3..]}
then this will work. Or you can just add a simple check like the function below will show.
def serialize(dct): # this is the sub {'a': [1,2,3]} dictionary
tmp = []
for value in dct.values():
if type(value) == list:
tmp.append(tuple(value))
else:
tmp.append(value)
return tuple(tmp)
def clean_up(lst):
seen = set()
clean = []
for dct in lst:
# grabs the 1..1000 key inside the primary dictionary
# assuming there is only 1 key being the "id" or the counter etc...
key = list(dct.keys())[0]
serialized = serialize(dct[key])
if serialized not in seen:
seen.add(serialized)
clean.append(dct)
return clean
So the function serialize
grabs the nested dictionary and creates a simple tuple from the contents. This is then checked if its in the set
"seen" to verify its unique.
generate a data set using some random values just because
lst = []
for i in range(1,1000):
dct = {
i: {
random.choice(string.ascii_letters): [n for n in range(random.randint(0,i))],
random.choice(string.ascii_letters): random.randint(0,i)
}
}
lst.append(dct)
Running the benchmarks:
%timeit clean_up(lst)
3.25 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit jdhesa(lst)
126 ms ± 606 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
As seen the function clean_up
is significantly faster but simpler (not necessarily a good thing) in its implementation checks.
Upvotes: 2
Reputation: 59701
You can maybe use a function like this to turn your objects into something hashable:
def make_hashable(o):
if isinstance(o, dict):
return frozenset((k, make_hashable(v)) for k, v in o.items())
elif isinstance(o, list):
return tuple(make_hashable(elem) for elem in o)
elif isinstance(o, set):
return frozenset(make_hashable(elem) for elem in o)
else:
return o
Then you keep a set of seen objects and keep only the keys of each dictionary containing objects that you did not see before:
lst = [
{1: {'a':[1,2,3], 'b': 4}},
{2: {'a':[4,5,6], 'd': 5}},
{3: {'a':[1,2,3], 'b': 4}},
]
seen = set()
result_keys = []
for elem in lst:
keep_keys = []
for k, v in elem.items():
v_hashable = make_hashable(v)
if v_hashable not in seen:
seen.add(v_hashable)
keep_keys.append(k)
result_keys.append(keep_keys)
result = [{k: elem[k] for k in keys} for elem, keys in zip(lst, result_keys) if keys]
print(result)
# [{1: {'a': [1, 2, 3], 'b': 4}}, {2: {'a': [4, 5, 6], 'd': 5}}]
Note that, as blhsing notes in the comments, this has some limitations, such as considering (1, 2)
and [1, 2]
equals, as well as {1: 2}
and {(1, 2)}
. Also, some types may not be convertible to an equivalent hashable type.
EDIT: As a_guest suggests, you can work around the type ambiguity by returning the type itself along with the hashable object in make_hashable
:
def make_hashable(o):
t = type(o)
if isinstance(o, dict):
o = frozenset((k, make_hashable(v)) for k, v in o.items())
elif isinstance(o, list):
o = tuple(make_hashable(elem) for elem in o)
elif isinstance(o, set):
o = frozenset(make_hashable(elem) for elem in o)
return t, o
If you don't need to look into the hashable object, this will easily provide strict type comparison. Note in this case even things like {1, 2}
and frozenset({1, 2})
will be different.
Upvotes: 3
Reputation: 106553
To get around the limitation of @jdehesa's solution, where [1, 2]
would be treated as a duplicate as (1, 2)
, you can preserve the data types by using pprint.pformat
instead to serialize the data structure. Since pprint.pformat
sorts dicts by keys and sets by items, {1: 2, 3: 4}
would be properly considered the same as {3: 4, 1: 2}
, but [1, 2]
would not be considered a duplicate to (1, 2)
:
from pprint import pformat
lst = [
{1: {'a': [1, 2, 3], 'b': 4}},
{2: {'a': [4, 5, 6], 'd': 5}},
{3: {'b': 4, 'a': [1, 2, 3]}},
{4: {'a': (4, 5, 6), 'd': 5}},
]
seen = set()
output = []
for d in lst:
for k, v in d.items():
signature = pformat(v)
if signature not in seen:
seen.add(signature)
output.append({k: v})
output
becomes:
[{1: {'a': [1, 2, 3], 'b': 4}},
{2: {'a': [4, 5, 6], 'd': 5}},
{4: {'a': (4, 5, 6), 'd': 5}}]
Upvotes: 5