Reputation: 221
From a Pandas newbie: I have data that looks essentially like this -
data1=pd.DataFrame({'Dir':['E','E','W','W','E','W','W','E'], 'Bool':['Y','N','Y','N','Y','N','Y','N'], 'Data':[4,5,6,7,8,9,10,11]}, index=pd.DatetimeIndex(['12/30/2000','12/30/2000','12/30/2000','1/2/2001','1/3/2001','1/3/2001','12/30/2000','12/30/2000']))
data1
Out[1]:
Bool Data Dir
2000-12-30 Y 4 E
2000-12-30 N 5 E
2000-12-30 Y 6 W
2001-01-02 N 7 W
2001-01-03 Y 8 E
2001-01-03 N 9 W
2000-12-30 Y 10 W
2000-12-30 N 11 E
And I want to group it by multiple levels, then do a cumsum():
E.g., like running_sum=data1.groupby(['Bool','Dir']).cumsum()
<-(Doesn't work)
with output that would look something like:
Bool Dir Date running_sum
N E 2000-12-30 16
W 2001-01-02 7
2001-01-03 16
Y E 2000-12-30 4
2001-01-03 12
W 2000-12-30 16
My "like" code is clearly not even close. I have made a number of attempts and learned many new things about how not to do this.
Thanks for any help you can give.
Upvotes: 13
Views: 18222
Reputation: 2253
As the other answer points out, you're trying to collapse identical dates into single rows, whereas the cumsum function will return a series of the same length as the original DataFrame. Stated differently, you actually want to group by [Bool, Dir, Date], calculate a sum in each group, THEN return a cumsum on rows grouped by [Bool, Dir]. The other answer is a perfectly valid solution to your specific question, here's a one-liner variation:
data1.groupby(['Bool', 'Dir', 'Date']).sum().groupby(level=[0, 1]).cumsum()
This returns output exactly in the requested format.
For those looking for a simple cumsum on a Pandas group, you can use:
data1.groupby(['Bool', 'Dir']).apply(lambda x: x['Data'].cumsum())
The cumulative sum is calculated internal to each group. Here's what the output looks like:
Bool Dir
N E 2000-12-30 5
2000-12-30 16
W 2001-01-02 7
2001-01-03 16
Y E 2000-12-30 4
2001-01-03 12
W 2000-12-30 6
2000-12-30 16
Name: Data, dtype: int64
Note the repeated dates, but this is doing a strict cumulative sum internal to the rows of each group identified by the Bool and Dir columns.
Upvotes: 9
Reputation: 17550
Try this:
data2 = data1.reset_index()
data3 = data2.set_index(["Bool", "Dir", "index"]) # index is the new column created by reset_index
running_sum = data3.groupby(level=[0,1,2]).sum().groupby(level=[0,1]).cumsum()
The reason you cannot simply use cumsum
on data3
has to do with how your data is structured. Grouping by Bool
and Dir
and applying an aggregation function (sum
, mean
, etc) would produce a DataFrame of a smaller size than you started with, as whatever function you used would aggregate values based on your group keys. However cumsum
is not an aggreagation function. It wil return a DataFrame that is the same size as the one it's called with. So unless your input DataFrame is in a format where the output can be the same size after calling cumsum
, it will throw an error. That's why I called sum
first, which returns a DataFrame in the correct input format.
Sorry if I haven't explained this well enough. Maybe someone else could help me out?
Upvotes: 15