6160
6160

Reputation: 1002

Fastest way to create dict from a large list of Dict

i need to create a dictionary from a large list of dict, removing all the duplicate dicts

the input list is something like:

input = [{'id': 1, 'value1': 'value1', 'value2': 'value2'},{'id': 2, 'value1': 'value1', 'value2': 'value2'}, {'id': 2, 'value1': 'value1', 'value3': 'value4'}]

and i want to create a dictionary like this, using "id" value as key for the new dict:

output = {
    1: [{'id': 1, 'value1': 'value1', 'value2': 'value2'}]
    2: [{'id': 2, 'value1': 'value1', 'value2': 'value2'}, {'id': 2, 'value1': 'value1', 'value3': 'value4'}]
}

my first try was:

    output = {}
    for el in input:
        if el['id'] not in output or el not in output[el['id']]:
            output.setdefault(el['id'], []).append(el)

and it actually works but it's super slow, len(input) is roughly 20k/30k items

is there any other way to do this a little bit faster?

thanks!

Upvotes: 2

Views: 617

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1121584

Use a separate set to track of seen dictionaries; you'll have to convert them to hashable representations first:

seen = set()
drepr = lambda d: tuple(sorted(d.items()))

output = {}
for el in input:
    if drepr(el) not in seen:
        output.setdefault(el['id'], []).append(el)
        seen.add(drepr(el))

You can speed it up a little by using a collections.defaultdict object as that'll materialize lists without having to look up a method and push a stack frame to call it:

from collections import defaultdict

seen = set()
drepr = lambda d: tuple(sorted(d.items()))

output = defaultdict(list)

for el in input:
    if drepr(el) not in seen:
        output[el['id']].append(el)
        seen.add(drepr(el))

Demo:

>>> input = [{'id': 1, 'value1': 'value1', 'value2': 'value2'},{'id': 2, 'value1': 'value1', 'value2': 'value2'}, {'id': 2, 'value1': 'value1', 'value3': 'value4'}]
>>> seen = set()
>>> drepr = lambda d: tuple(sorted(d.items()))
>>> output = {}
>>> for el in input:
...     if drepr(el) not in seen:
...         output.setdefault(el['id'], []).append(el)
...         seen.add(drepr(el))
... 
>>> from pprint import pprint
>>> pprint(output)
{1: [{'id': 1, 'value1': 'value1', 'value2': 'value2'}],
 2: [{'id': 2, 'value1': 'value1', 'value2': 'value2'},
     {'id': 2, 'value1': 'value1', 'value3': 'value4'}]}

Upvotes: 2

Prem Anand
Prem Anand

Reputation: 2537

Create a new output list with the duplicate dicts and then make each element in the new list a unique set of dicts

input=[{'id': 1, 'value1': 'value1', 'value2': 'value2'}, {'id': 2, 'value1': 'value1', 'value2': 'value2'}, {'value3': 'value4', 'id': 2, 'value1': 'value1'}, {'id': 2, 'value1': 'value1', 'value2': 'value2'}]

output={}
for el in input: output.setdefault(el['id'], []).append(el)
for key in output:
    output[key]=map(dict, set(tuple(sorted(d.items())) for d in output[key]))

print output

[{'id': 2, 'value1': 'value1', 'value2': 'value2'}, {'value3': 'value4', 'id': 2, 'value1': 'value1'}]

Upvotes: 0

Related Questions