Get the cumulative most frequent status for specific column in pandas dataframe

Question

I have a dataframe:

# create example df
df = pd.DataFrame(index=[1,2,3,4,5,6,7,8])
df['ID'] = [1,1,1,1,2,2,2,2]
df['election_date'] = pd.date_range("01/01/2010", periods=8, freq="M")
df['status'] = ['b','a','b','c','a','d','d','b']

# sort values
df.sort_values(['election_date'], inplace=True, ascending=False)
df.reset_index(drop=True, inplace=True)

df 

ID  election_date   status
0   2   2010-08-31    b
1   2   2010-07-31    d
2   2   2010-06-30    d
3   2   2010-05-31    a
4   1   2010-04-30    c
5   1   2010-03-31    b
6   1   2010-02-28    a
7   1   2010-01-31    b

I would like to get the cumulative most frequent status for column status for each ID. This is what I would expect:

    ID  election_date   status  cum_most_freq_status
0   2   2010-08-31        b          d
1   2   2010-07-31        d          d
2   2   2010-06-30        d          a
3   2   2010-05-31        a          NaN
4   1   2010-04-30        c          b
5   1   2010-03-31        b          a 
6   1   2010-02-28        a          b
7   1   2010-01-31        b          NaN

Interpretation:

for 2010-01-31 the value is NaN because there was no status value in the past. The same works for 2010-05-31.
for 2010-03-31 the most frequent status in the past was a and b. Therefore we take the most recent value, which was a.

How would you do it?

perl · Accepted Answer

You can first make a DataFrame with ID and election_date as its index, and one-hot-encoded status values, then calculate cumsum.

We want to pick the most recent status if there is a tie in counts, so I'm adding a small number (less than 1) to cumsum for the current status, so when we apply idxmax it will pick up the most recent status in case there's a tie.

After finding the most frequent cumulative status with idxmax we can merge with the original DataFrame:

# make one-hot-encoded status dataframe
z = (df
         .groupby(['ID', 'election_date', 'status'])
         .size().unstack().fillna(0))

# break ties to choose most recent
z = z.groupby(level=0).cumsum() + (z * 1e-4)

# shift by 1 row, since we only count previous status occurrences
z = z.groupby(level=0).shift()

# merge
df.merge(z.idxmax(axis=1).to_frame('cum_most_freq_status').reset_index())

Output:

   ID election_date status cum_most_freq_status
0   2    2010-08-31      b                    d
1   2    2010-07-31      d                    d
2   2    2010-06-30      d                    a
3   2    2010-05-31      a                  NaN
4   1    2010-04-30      c                    b
5   1    2010-03-31      b                    a
6   1    2010-02-28      a                    b
7   1    2010-01-31      b                  NaN

Get the cumulative most frequent status for specific column in pandas dataframe

Answers (1)

Related Questions