Kyle Getrost
Kyle Getrost

Reputation: 737

Group by and aggregate the values of a list of dictionaries in Python

I'm trying to write a function, in an elegant way, that will group a list of dictionaries and aggregate (sum) the values of like-keys.

Example:

my_dataset = [  
    {
        'date': datetime.date(2013, 1, 1),
        'id': 99,
        'value1': 10,
        'value2': 10
    },
    {
        'date': datetime.date(2013, 1, 1),
        'id': 98,
        'value1': 10,
        'value2': 10
    },
    {
        'date': datetime.date(2013, 1, 2),
        'id' 99,
        'value1': 10,
        'value2': 10
    }
]

group_and_sum_dataset(my_dataset, 'date', ['value1', 'value2'])

"""
Should return:
[
    {
        'date': datetime.date(2013, 1, 1),
        'value1': 20,
        'value2': 20
    },
    {
        'date': datetime.date(2013, 1, 2),
        'value1': 10,
        'value2': 10
    }
]
"""

I've tried doing this using itertools for the groupby and summing each like-key value pair, but am missing something here. Here's what my function currently looks like:

def group_and_sum_dataset(dataset, group_by_key, sum_value_keys):
    keyfunc = operator.itemgetter(group_by_key)
    dataset.sort(key=keyfunc)
    new_dataset = []
    for key, index in itertools.groupby(dataset, keyfunc):
        d = {group_by_key: key}
        d.update({k:sum([item[k] for item in index]) for k in sum_value_keys})
        new_dataset.append(d)
    return new_dataset

Upvotes: 25

Views: 39916

Answers (3)

pylang
pylang

Reputation: 44465

Here's an approach using more_itertools where you simply focus on how to construct output.

Given

import datetime
import collections as ct

import more_itertools as mit


dataset = [
    {"date": datetime.date(2013, 1, 1), "id": 99, "value1": 10, "value2": 10},
    {"date": datetime.date(2013, 1, 1), "id": 98, "value1": 10, "value2": 10},
    {"date": datetime.date(2013, 1, 2), "id": 99, "value1": 10, "value2": 10}
]

Code

# Step 1: Build helper functions    
kfunc = lambda d: d["date"]
vfunc = lambda d: {k:v for k, v in d.items() if k.startswith("val")}
rfunc = lambda lst: sum((ct.Counter(d) for d in lst), ct.Counter())

# Step 2: Build a dict    
reduced = mit.map_reduce(dataset, keyfunc=kfunc, valuefunc=vfunc, reducefunc=rfunc)
reduced

Output

defaultdict(None,
            {datetime.date(2013, 1, 1): Counter({'value1': 20, 'value2': 20}),
             datetime.date(2013, 1, 2): Counter({'value1': 10, 'value2': 10})})

The items are grouped by date and pertinent values are reduced as Counters.


Details

Steps

  1. build helper functions to customize construction of keys, values and reduced values in the final defaultdict. Here we want to:
    • group by date (kfunc)
    • built dicts keeping the "value*" parameters (vfunc)
    • aggregate the dicts (rfunc) by converting to collections.Counters and summing them. See an equivalent rfunc below+.
  2. pass in the helper functions to more_itertools.map_reduce.

Simple Groupby

... say in that example you wanted to group by id and date?

No problem.

>>> kfunc2 = lambda d: (d["date"], d["id"])
>>> mit.map_reduce(dataset, keyfunc=kfunc2, valuefunc=vfunc, reducefunc=rfunc)
defaultdict(None,
            {(datetime.date(2013, 1, 1),
              99): Counter({'value1': 10, 'value2': 10}),
             (datetime.date(2013, 1, 1),
              98): Counter({'value1': 10, 'value2': 10}),
             (datetime.date(2013, 1, 2),
              99): Counter({'value1': 10, 'value2': 10})})

Customized Output

While the resulting data structure clearly and concisely presents the outcome, the OP's expected output can be rebuilt as a simple list of dicts:

>>> [{**dict(date=k), **v} for k, v in reduced.items()]
[{'date': datetime.date(2013, 1, 1), 'value1': 20, 'value2': 20},
 {'date': datetime.date(2013, 1, 2), 'value1': 10, 'value2': 10}]

For more on map_reduce, see the docs. Install via > pip install more_itertools.

+An equivalent reducing function:

def rfunc(lst: typing.List[dict]) -> ct.Counter:
    """Return reduced mappings from map-reduce values."""
    c = ct.Counter()
    for d in lst:
        c += ct.Counter(d)
    return c

Upvotes: 3

Kyle Getrost
Kyle Getrost

Reputation: 737

Thanks, I forgot about Counter. I still wanted to maintain the output format and sorting of my returned dataset, so here's what my final function looks like:

def group_and_sum_dataset(dataset, group_by_key, sum_value_keys):

    container = defaultdict(Counter)

    for item in dataset:
        key = item[group_by_key]
        values = {k:item[k] for k in sum_value_keys}
        container[key].update(values)

    new_dataset = [
        dict([(group_by_key, item[0])] + item[1].items())
            for item in container.items()
    ]
    new_dataset.sort(key=lambda item: item[group_by_key])

    return new_dataset

Upvotes: 5

Ashwini Chaudhary
Ashwini Chaudhary

Reputation: 250881

You can use collections.Counter and collections.defaultdict.

Using a dict this can be done in O(N), while sorting requires O(NlogN) time.

from collections import defaultdict, Counter
def solve(dataset, group_by_key, sum_value_keys):
    dic = defaultdict(Counter)
    for item in dataset:
        key = item[group_by_key]
        vals = {k:item[k] for k in sum_value_keys}
        dic[key].update(vals)
    return dic
... 
>>> d = solve(my_dataset, 'date', ['value1', 'value2'])
>>> d
defaultdict(<class 'collections.Counter'>,
{
 datetime.date(2013, 1, 2): Counter({'value2': 10, 'value1': 10}),
 datetime.date(2013, 1, 1): Counter({'value2': 20, 'value1': 20})
})

The advantage of Counter is that it'll automatically sum the values of similar keys.:

Example:

>>> c = Counter(**{'value1': 10, 'value2': 5})
>>> c.update({'value1': 7, 'value2': 3})
>>> c
Counter({'value1': 17, 'value2': 8})

Upvotes: 30

Related Questions