Woody Pride
Woody Pride

Reputation: 13965

Collapsing Data using GroupBy Pandas

I have a pandas data frame that represents time series data. I have a column called DTDate (which is a date time date) and a column called line_code (which is the unit of observation - it happens to be a production line in a factory). I have many columns of data, but for the sake of this question lets imagine there are only three:

workers - the number of workers on the production line. item - the name of item being produced on the production line. output - the amount of output for the item on the production line.

Some days each line only produces one item, sometimes multiple items. Therefore sometimes there is one observation per DTDate/line_code, and sometimes there are multiple observations. I need to collapse the data set into a single observation per DTDate/line_code.

Here is the rub - we do not yet know how we want to aggregate the data, and as such currently I just need to achieve a structure where we can try multiple aggregation methods. When the line only produces one item, I simply need to preserve the data row as it is. When the line produces more than one item on a given DTDate I want to collapse the observations into a single observation according to the following:

workers: if the number of workers is equal within DTDate/line_code observations, then a single value of workers is carries forard to the collapsed set. If the number os workers is not equal then create a list object containing all the values of workers within the DTDate/line_code observations. item: a list object of the items is carried forward to the collapsed set. output: a list object of the output is carried forward to the collapsed set.

By having list items in the collapsed set, I am leaving the strucutre flexible enough to allow myself to then experiment with different aggregation methods for each column as and when I am instructed to that effect.

sofar I have grouped the data as follows:

import pandas as pd
import numpy as np
from pandas import DataFrame
DF = DataFrame(mydata, columns = ['DTDate', 'line_code', \
                                  'workers', 'item', 'output'])

DFGrouped = DF.groupby(['DTDate', 'line_code'])

Now I realise that what I am wanting to do is the following:

DFAggregated = DFGrouped.agg({'DTDate': max(), 'line_code' : max(), \
                              'workers' : myfunc1, 'item' : myfunc2, \
                              'output' : myfunc2})

where: myfunc1 evaluates if all values in the group of the column specified are equal and returns a single value if they are or a list of each value if they are not.

myfunc2 returns a list of all the values in the group within the specified column.

My issue is that I have no idead how to write those function, largely becasue I am not clear on how to iterate through the group specific index/rows. I have read the python documentation on grouby etc. but found it not very useful. I realise I should be posting more code of what I've tried, but I am finding it hard to even get off the ground here. Any pointers would be greatly appreciated.

(now extending to give illustrative function code)

Incidentall I would expect myfunc1 and myfunc2 to look something like this:

def myfunc1(ColName):
    if len(set([DFGroup[ColName][x] for x in DFGroup.index])) == 1:
        return DFGroup[ColName].max()
    else:
            return [DFGroup[ColName][x] for x in DFGroup.index]

def myfunc2(ColName):
    return [DFGroup[ColName][x] for x in DFGroup.index] 

as you can see I am not sure how to refer to the group index etc.

Upvotes: 0

Views: 4087

Answers (1)

BrenBarn
BrenBarn

Reputation: 251408

Each aggregating function (the functions you pass to agg) is passed the column it aggregates over, as a Series. So your myfunc2 is just lambda x: list(x.unique()). Your myfunc1 would be:

def collapse(x):
    uniq = x.unique()
    if len(uniq) == 1:
        return uniq[0]
    else:
        return list(uniq)

However, you may find the results somewhat awkward to work with. At the least, I would think you might want to just always return a list (that is, forget myfunc1 and always use myfunc2). You will find it awkward to deal with a column where some of the values are individual scalars and others are lists.

Also, you might instead want to look at using apply, which lets you return an entire DataFrame. This way you could, instead of collapsing the items into a list, actually return a new grouped table with one row for each unique value in the source column.

Upvotes: 2

Related Questions