Mike
Mike

Reputation: 858

Increasing count where a condition is met within pandas GroupBy

How do I count the number of multicolumn (thing, cond=1) event occurrences prior to every (thing, cond=any) event?


(These could be winning games of poker by player, episodes of depression by patient, or so on.) For example, row index == 3, below, contains the pair (thing, cond) = (c,2), and shows the number of prior (c,1) occurrences, which is correctly (but manually) shown in the priors column as 0. I'm interested in producing a synthetic column with the count of prior (thing, 1) events for every (thing, event) pair in my data. My data are monotonically increasing in time. The natural index in the silly DataFrame can be taken as logical ticks, if it helps. (<Narrator>: It really doesn't.)

For convenience, below is the code for my test DataFrame and the manually generated priors column, which I cannot get pandas to usefully generate, no matter which combinations of groupby, cumsum, shift, where, & etc. I try. I have googled and wracked my brain for days. No SO answers seem to fit the bill. The key to reading the priors column is that its entries say things like, "Before this (a,1) or (a,2) event, there have been 2 (a,1) events."

[In]:
import pandas as pd
silly = pd.DataFrame({'thing': ['a','b','a','c','b','c','c','a','a','b','c','a'], "cond": [1,2,1,2,1,2,1,2,1,2,1,2]})
silly['priors'] = pd.Series([0,0,1,0,0,0,0,2,2,1,1,3])
silly

[Out]:
silly
   thing  cond  priors
0      a     1       0
1      b     2       0
2      a     1       1
3      c     2       0
4      b     1       0
5      c     2       0
6      c     1       0
7      a     2       2
8      a     1       2
9      b     2       1
10     c     1       1
11     a     2       3

The closest I've come is:

silly
[In]:
silly['priors_inc'] = silly[['thing', 'cond']].where(silly['cond'] == 1).groupby('thing').cumsum() - 1

[Out]:

silly
   thing  cond  priors  priors_inc
0      a     1       0         0.0
1      b     2       0         NaN
2      a     1       1         1.0
3      c     2       0         NaN
4      b     1       0         0.0
5      c     2       0         NaN
6      c     1       0         0.0
7      a     2       2         NaN
8      a     1       2         2.0
9      b     2       1         NaN
10     c     1       1         1.0
11     a     2       3         NaN

Note that the values that are present in the incomplete priors column are correct, but not all of the desired data are there.

Please, if at all possible, withhold any "Pythonic" answers. While my real data are small compared to most ML problems, I want to learn pandas the right way, not the toy data way with Python loops or itertools chicanery that I've seen too much of already. Thank you in advance! (And I apologize for the wall of text!)

Upvotes: 1

Views: 459

Answers (1)

cs95
cs95

Reputation: 402333

You need to

  1. Cumulatively count where each "cond" is 1
  2. Do this for each "thing"
  3. Make sure the counts are shifted by 1.

You can do this using groupby, cumsum and shift:

(df.cond.eq(1)
   .groupby(df.thing)
   .apply(lambda x: x.cumsum().shift())
   .fillna(0, downcast='infer'))

0     0
1     0
2     1
3     0
4     0
5     0
6     0
7     2
8     2
9     1
10    1
11    3
Name: cond, dtype: int64

Another option to avoid the apply is to chain two groupby calls—one does the shifting, the other performs the cumsum.

(df.cond.eq(1)
   .groupby(df.thing)
   .cumsum()
   .groupby(df.thing)
   .shift()
   .fillna(0, downcast='infer'))

0     0
1     0
2     1
3     0
4     0
5     0
6     0
7     2
8     2
9     1
10    1
11    3
Name: cond, dtype: int64

Upvotes: 2

Related Questions