Reputation: 1002
i need to create a dictionary from a large list of dict, removing all the duplicate dicts
the input list is something like:
input = [{'id': 1, 'value1': 'value1', 'value2': 'value2'},{'id': 2, 'value1': 'value1', 'value2': 'value2'}, {'id': 2, 'value1': 'value1', 'value3': 'value4'}]
and i want to create a dictionary like this, using "id" value as key for the new dict:
output = {
1: [{'id': 1, 'value1': 'value1', 'value2': 'value2'}]
2: [{'id': 2, 'value1': 'value1', 'value2': 'value2'}, {'id': 2, 'value1': 'value1', 'value3': 'value4'}]
}
my first try was:
output = {}
for el in input:
if el['id'] not in output or el not in output[el['id']]:
output.setdefault(el['id'], []).append(el)
and it actually works but it's super slow, len(input) is roughly 20k/30k items
is there any other way to do this a little bit faster?
thanks!
Upvotes: 2
Views: 617
Reputation: 1121584
Use a separate set
to track of seen dictionaries; you'll have to convert them to hashable representations first:
seen = set()
drepr = lambda d: tuple(sorted(d.items()))
output = {}
for el in input:
if drepr(el) not in seen:
output.setdefault(el['id'], []).append(el)
seen.add(drepr(el))
You can speed it up a little by using a collections.defaultdict
object as that'll materialize lists without having to look up a method and push a stack frame to call it:
from collections import defaultdict
seen = set()
drepr = lambda d: tuple(sorted(d.items()))
output = defaultdict(list)
for el in input:
if drepr(el) not in seen:
output[el['id']].append(el)
seen.add(drepr(el))
Demo:
>>> input = [{'id': 1, 'value1': 'value1', 'value2': 'value2'},{'id': 2, 'value1': 'value1', 'value2': 'value2'}, {'id': 2, 'value1': 'value1', 'value3': 'value4'}]
>>> seen = set()
>>> drepr = lambda d: tuple(sorted(d.items()))
>>> output = {}
>>> for el in input:
... if drepr(el) not in seen:
... output.setdefault(el['id'], []).append(el)
... seen.add(drepr(el))
...
>>> from pprint import pprint
>>> pprint(output)
{1: [{'id': 1, 'value1': 'value1', 'value2': 'value2'}],
2: [{'id': 2, 'value1': 'value1', 'value2': 'value2'},
{'id': 2, 'value1': 'value1', 'value3': 'value4'}]}
Upvotes: 2
Reputation: 2537
Create a new output list with the duplicate dicts and then make each element in the new list a unique set of dicts
input=[{'id': 1, 'value1': 'value1', 'value2': 'value2'}, {'id': 2, 'value1': 'value1', 'value2': 'value2'}, {'value3': 'value4', 'id': 2, 'value1': 'value1'}, {'id': 2, 'value1': 'value1', 'value2': 'value2'}]
output={}
for el in input: output.setdefault(el['id'], []).append(el)
for key in output:
output[key]=map(dict, set(tuple(sorted(d.items())) for d in output[key]))
print output
[{'id': 2, 'value1': 'value1', 'value2': 'value2'}, {'value3': 'value4', 'id': 2, 'value1': 'value1'}]
Upvotes: 0