Pandas rolling function on categorical variables

Question

I have a pandas dataframe like this

I'm trying to group the data by group, then applies a custom function to the past 5 rows. The custom function looks like this

def unalikeability(data):

    num_observations = data.shape[0]
    counts = data.value_counts()

    return 1 - ((counts / num_observations)**2).sum()

Desired output:

group unalikeability
1     result calculated by the function
1
1
1
2
2
2
2

I can get the past 5 rows using groupby().rolling(), but the rolling object in pandas doesn't have the shape/ value_counts attribute and method like a DataFrame. I tried creating a DataFrame from the rolling object, but this isn't allowed either.

mozway · Accepted Answer

You can apply your function. Depending on whether you want the output to be computed only on full chunks (5 values), or chunks of any size, use min_periods:

def unalikeability(data):

    num_observations = data.shape[0]
    counts = data.value_counts()

    return 1 - ((counts / num_observations)**2).sum()

# compute the score only if we have 5 rows
df['out1'] = (df.groupby('group')
                .rolling(5)['cat']
                .apply(unalikeability)
                .droplevel('group')
              )

# compute the score with incomplete chunks
df['out2'] = (df.groupby('group')
                .rolling(5, min_periods=1)['cat']
                .apply(unalikeability)
                .droplevel('group')
              )

Output:

   group  cat  out1      out2
0      1    0   NaN  0.000000
1      2    0   NaN  0.000000
2      1    0   NaN  0.000000
3      1    1   NaN  0.444444
4      2    0   NaN  0.000000
5      2    1   NaN  0.444444
6      1    2   NaN  0.625000
7      1    2  0.64  0.640000

Pandas rolling function on categorical variables

Answers (2)

Related Questions