itjcms18
itjcms18

Reputation: 4333

How to calculate a rolling count of a categorical variable in pandas

I'm attempting to do a rolling count on a dataframe. The problem that I am having is specifying the condition since it is a string, not an integer. The dataframe below is a snippet, along with a snippet of a dictionary.

    GameID  Event
0   100     NaN
1   100     NaN
2   100     Ben
3   100     NaN
4   100     Steve
5   100     Ben
6   100     NaN
7   100     Steve
8   100     NaN
9   100     NaN
10  101     NaN   
11  101     NaN   
12  101     Joe   
13  101     NaN   
14  101     Will  
15  101     Joe   
16  101     NaN   
17  101     Will 
18  101     NaN    
19  101     NaN   



gamedic = {'100':['Ben','Steve'], '101':['Joe','Will']}

Ultimately, I would want the dataframe to look like the following. I named the columns Ben and Steve for this example but in reality they will be First and Second, corresponding to their place in the dictionary.

    GameID  Event  Ben  Steve
0   100     NaN     0    0
1   100     NaN     0    0
2   100     Ben     0    0
3   100     NaN     1    0
4   100     Steve   1    0
5   100     Ben     1    1
6   100     NaN     2    1
7   100     Steve   2    1 
8   100     NaN     2    2
9   100     NaN     2    2
10  101     NaN     0    0
11  101     NaN     0    0
12  101     Joe     0    0
13  101     NaN     1    0
14  101     Will    1    0
15  101     Joe     1    1
16  101     NaN     2    1
17  101     Will    2    1 
18  101     NaN     2    2
19  101     NaN     2    2


pd.rolling_count(df.Event, 1000,0).shift(1)
ValueError: could not convert string to float: Steve

I'm not sure if this is a complicated problem or if I'm missing something obvious in pandas. The whole string concept makes it tough for me to even get going.

Upvotes: 2

Views: 2059

Answers (2)

maxymoo
maxymoo

Reputation: 36545

First you want to use your dictionary to get a column containing just "first" and "second". I cant think of a clever way to do this so let's just iterate over the rows:

import numpy as np
df['Winner'] = np.nan
for i,row in df.iterrows():
   if row.Event == gamedic[row.GameID][0]:
      df['Winner'].ix[i] = 'First'
   if row.Event == gamedic[row.GameID][1]:
      df['Winner'].ix[i] = 'Second'

You can use pd.get_dummies to convert a string column (representing a categorical variable) to indicator variables; in your case this will give you

pd.get_dummies(df.Winner)
Out[46]: 
    First  Second
0       0       0
1       0       0
2       1       0
3       0       0
4       0       1
5       1       0
6       0       0
7       0       1
8       0       0
9       0       0
10      0       0
11      0       0
12      1       0
13      0       0
14      0       1
15      1       0
16      0       0
17      0       1
18      0       0
19      0       0

You can add these onto your original dataframe with pd.concat:

df = pd.concat([df,pd.get_dummies(df.Winner)],axis=1)

Then you can get your cumulative sums with groupby.cumsum as in @Brian's answer

df.groupby('GameID').cumsum()
Out[60]: 
    First  Second
0       0       0
1       0       0
2       1       0
3       1       0
4       1       1
5       2       1
6       2       1
7       2       2
8       2       2
9       2       2
10      0       0
11      0       0
12      1       0
13      1       0
14      1       1
15      2       1
16      2       1
17      2       2
18      2       2
19      2       2

Upvotes: 1

Brian Pendleton
Brian Pendleton

Reputation: 829

Is this what you're looking for?

df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']],
                  columns=['A'])
df
   A
0  a
1  a
2  a
3  b
4  b
5  a

df.groupby('A').cumcount()

0    0
1    1
2    2
3    0
4    1
5    3
dtype: int64

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.cumcount.html

Upvotes: 0

Related Questions