Reputation: 858
How do I count the number of multicolumn (thing, cond=1) event occurrences prior to every (thing, cond=any) event?
(These could be winning games of poker by player, episodes of depression by patient, or so on.) For example, row index == 3
, below, contains the pair (thing, cond) = (c,2)
, and shows the number of prior (c,1)
occurrences, which is correctly (but manually) shown in the priors
column as 0
. I'm interested in producing a synthetic column with the count of prior (thing, 1) events for every (thing, event) pair in my data. My data are monotonically increasing in time. The natural index in the silly
DataFrame
can be taken as logical ticks, if it helps. (<Narrator>: It really doesn't.)
For convenience, below is the code for my test DataFrame
and the manually generated priors
column, which I cannot get pandas to usefully generate, no matter which combinations of groupby
, cumsum
, shift
, where
, & etc. I try. I have googled and wracked my brain for days. No SO answers seem to fit the bill. The key to reading the priors column is that its entries say things like, "Before this (a,1) or (a,2) event, there have been 2 (a,1) events."
[In]:
import pandas as pd
silly = pd.DataFrame({'thing': ['a','b','a','c','b','c','c','a','a','b','c','a'], "cond": [1,2,1,2,1,2,1,2,1,2,1,2]})
silly['priors'] = pd.Series([0,0,1,0,0,0,0,2,2,1,1,3])
silly
[Out]:
silly
thing cond priors
0 a 1 0
1 b 2 0
2 a 1 1
3 c 2 0
4 b 1 0
5 c 2 0
6 c 1 0
7 a 2 2
8 a 1 2
9 b 2 1
10 c 1 1
11 a 2 3
The closest I've come is:
silly
[In]:
silly['priors_inc'] = silly[['thing', 'cond']].where(silly['cond'] == 1).groupby('thing').cumsum() - 1
[Out]:
silly
thing cond priors priors_inc
0 a 1 0 0.0
1 b 2 0 NaN
2 a 1 1 1.0
3 c 2 0 NaN
4 b 1 0 0.0
5 c 2 0 NaN
6 c 1 0 0.0
7 a 2 2 NaN
8 a 1 2 2.0
9 b 2 1 NaN
10 c 1 1 1.0
11 a 2 3 NaN
Note that the values that are present in the incomplete priors column are correct, but not all of the desired data are there.
Please, if at all possible, withhold any "Pythonic" answers. While my real data are small compared to most ML problems, I want to learn pandas the right way, not the toy data way with Python loops or itertools chicanery that I've seen too much of already. Thank you in advance! (And I apologize for the wall of text!)
Upvotes: 1
Views: 459
Reputation: 402333
You need to
shift
ed by 1.You can do this using groupby
, cumsum
and shift
:
(df.cond.eq(1)
.groupby(df.thing)
.apply(lambda x: x.cumsum().shift())
.fillna(0, downcast='infer'))
0 0
1 0
2 1
3 0
4 0
5 0
6 0
7 2
8 2
9 1
10 1
11 3
Name: cond, dtype: int64
Another option to avoid the apply
is to chain two groupby
calls—one does the shifting, the other performs the cumsum.
(df.cond.eq(1)
.groupby(df.thing)
.cumsum()
.groupby(df.thing)
.shift()
.fillna(0, downcast='infer'))
0 0
1 0
2 1
3 0
4 0
5 0
6 0
7 2
8 2
9 1
10 1
11 3
Name: cond, dtype: int64
Upvotes: 2