Reputation: 13965
I have a pandas data frame that represents time series data. I have a column called DTDate (which is a date time date) and a column called line_code (which is the unit of observation - it happens to be a production line in a factory). I have many columns of data, but for the sake of this question lets imagine there are only three:
workers - the number of workers on the production line. item - the name of item being produced on the production line. output - the amount of output for the item on the production line.
Some days each line only produces one item, sometimes multiple items. Therefore sometimes there is one observation per DTDate/line_code, and sometimes there are multiple observations. I need to collapse the data set into a single observation per DTDate/line_code.
Here is the rub - we do not yet know how we want to aggregate the data, and as such currently I just need to achieve a structure where we can try multiple aggregation methods. When the line only produces one item, I simply need to preserve the data row as it is. When the line produces more than one item on a given DTDate I want to collapse the observations into a single observation according to the following:
workers: if the number of workers is equal within DTDate/line_code observations, then a single value of workers is carries forard to the collapsed set. If the number os workers is not equal then create a list object containing all the values of workers within the DTDate/line_code observations. item: a list object of the items is carried forward to the collapsed set. output: a list object of the output is carried forward to the collapsed set.
By having list items in the collapsed set, I am leaving the strucutre flexible enough to allow myself to then experiment with different aggregation methods for each column as and when I am instructed to that effect.
sofar I have grouped the data as follows:
import pandas as pd
import numpy as np
from pandas import DataFrame
DF = DataFrame(mydata, columns = ['DTDate', 'line_code', \
'workers', 'item', 'output'])
DFGrouped = DF.groupby(['DTDate', 'line_code'])
Now I realise that what I am wanting to do is the following:
DFAggregated = DFGrouped.agg({'DTDate': max(), 'line_code' : max(), \
'workers' : myfunc1, 'item' : myfunc2, \
'output' : myfunc2})
where: myfunc1 evaluates if all values in the group of the column specified are equal and returns a single value if they are or a list of each value if they are not.
myfunc2 returns a list of all the values in the group within the specified column.
My issue is that I have no idead how to write those function, largely becasue I am not clear on how to iterate through the group specific index/rows. I have read the python documentation on grouby etc. but found it not very useful. I realise I should be posting more code of what I've tried, but I am finding it hard to even get off the ground here. Any pointers would be greatly appreciated.
(now extending to give illustrative function code)
Incidentall I would expect myfunc1 and myfunc2 to look something like this:
def myfunc1(ColName):
if len(set([DFGroup[ColName][x] for x in DFGroup.index])) == 1:
return DFGroup[ColName].max()
else:
return [DFGroup[ColName][x] for x in DFGroup.index]
def myfunc2(ColName):
return [DFGroup[ColName][x] for x in DFGroup.index]
as you can see I am not sure how to refer to the group index etc.
Upvotes: 0
Views: 4087
Reputation: 251408
Each aggregating function (the functions you pass to agg
) is passed the column it aggregates over, as a Series. So your myfunc2
is just lambda x: list(x.unique())
. Your myfunc1
would be:
def collapse(x):
uniq = x.unique()
if len(uniq) == 1:
return uniq[0]
else:
return list(uniq)
However, you may find the results somewhat awkward to work with. At the least, I would think you might want to just always return a list (that is, forget myfunc1
and always use myfunc2
). You will find it awkward to deal with a column where some of the values are individual scalars and others are lists.
Also, you might instead want to look at using apply
, which lets you return an entire DataFrame. This way you could, instead of collapsing the items into a list, actually return a new grouped table with one row for each unique value in the source column.
Upvotes: 2