Wells
Wells

Reputation: 10969

slow python code seems like a suitable fit for itertools: how to optimize?

I have this:

entity_key = 'pid'
data = [ { ... }, { ... } ]
entities = list(set([ row[entity_key] for row in data ]))
parsed = []
total_keys = ['a','b','c']
for entity_id in entities:
    entity_rows = [ row for row in data if row[entity_key] == entity_id ]
    totals = { key: sum(filter(None, [ row.get(key) for row in entity_rows ])) for key in total_keys }
    totals[entity_key] = entity_id
    parsed.append(totals)
return parsed

In my scenario, data is about 30,000 items, it's large.

Each item is a dict, each dict contains the identifier pid, and numeric values for each item defined in total_keys, e.g. { 'pid': 5011, 'a': 3, 'b': 20, 'c': 33 }

As you see, the code returns a list of unique rows for each pid, with the summed columns defined in the total_keys list. There may be about 800-1000 unique pid values, so parsed ends up being about 800-1000 items.

It's slow. I've tried to re-write this using itertools.groupby but it doesn't seem the best fit. Is there some magic I'm missing?

Upvotes: 1

Views: 181

Answers (3)

Padraic Cunningham
Padraic Cunningham

Reputation: 180401

make a single dict using pids as the outer keys:

entity_key = 'pid'

data = [ { 'pid': 5011, 'a': 3, 'b': 20, 'c': 33 },{ 'pid': 5012, 'a': 3, 'b': 20, 'c': 33 }, 
{ 'pid': 5011, 'a': 3, 'b': 20, 'c': 33 },{ 'pid': 5012, 'a': 3, 'b': 20, 'c': 33 }]


from collections import defaultdict

totals = ["a", "b", "c"]

dfd = defaultdict(lambda: {"a":0, "b", 0, "c": 0})
for d in data:
    for k in d.keys() & totals:
        dfd[d["pid"]][k] += d[k]

Output will be all your pid's grouped and any a b or c key values summed:

defaultdict(<function <lambda> at 0x7f2cf93ed2f0>,
 {5011: {'a': 6, 'c': 66, 'b': 40}, 5012: {'a': 6, 'c': 66, 'b': 40}})

For python2 you need to use uni = d.viewkeys() & totals

If you data was actually grouped you could yield a group at a time:

from collections import defaultdict
from itertools import groupby
from operator import itemgetter

def yield_d(data,k, keys):
    for k,v in groupby(data, key=itemgetter(k)):
        d = defaultdict(lambda: dict.fromkeys(keys, 0))
        for dct in v:
            for _k in dct.keys() & keys:
                d[k][_k] += dct[_k]
        yield d

Upvotes: 2

recursive
recursive

Reputation: 86074

You've got an O(n^2) algorithm because of your membership test inside the loop. If you create an indexed data structure, you can improve performance a lot.

entity_key = 'pid'
data = [ { ... }, { ... } ]
totals_keys = ['a','b','c']
parsed = []

indexed = {}
for row in data: # construct a map of data rows, indexed by id
    entity_id = row[entity_key]
    indexed.setdefault(entity_id, []) # start with an empty list
    indexed[entity_id].append(row)

for entity_id in entities:
    entity_rows = indexed[entity_id] # fast lookup of matching ids
    totals = { key: sum(row[key] for row in entity_rows if key in row) for key in totals_keys }
    totals[entity_key] = entity_id
    parsed.append(totals)

return parsed

Upvotes: 2

Severin Pappadeux
Severin Pappadeux

Reputation: 20080

Did you try Pandas?

If you have pid as a column, looks like a good match for

import pandas as pd

df = pd.DataFrame(your dictionary)

df.groupby(['pid']).sum()

Upvotes: 1

Related Questions