Vero
Vero

Reputation: 459

Compare values in dataframe column by fractions

I have a dataset:

enter image description here

I need to compare index of the last month (within a group of id) with two previous months and check if it's the same.

I'm trying code:

import pandas as pd
import numpy as np

df = pd.DataFrame({'id':[1,1,1,1,2,2,2,2,3,3,3,3],
               'month':[202001,202002,202003,202004,202001,202002,202003,202004,202001,202002,202003,202004,],
               'index':[3,  3,  3,  3,  4,  4,  5,  5,  2,  3,  3,  3]})

df['check']=np.where(df.sort_values(['id', 'month'])
                             .groupby('id')['index']
                             .apply(lambda x: x.shift(3))
                             .transform('nunique')>1,1,0)

It returns error: ValueError: transforms cannot produce aggregated results

Without "apply" the code works. What am I doing wrong?

The desired output is something like this: enter image description here

Thank you for any suggestions

Upvotes: 2

Views: 91

Answers (2)

anky
anky

Reputation: 75100

You can try groupby.shift and subtract from index then clip the upper limit to 1:

df['check'] = df['index'].sub(df.groupby("id")['index'].shift(2)).clip(upper=1)

print(df)

    id   month  index  check
0    1  202001      3    NaN
1    1  202002      3    NaN
2    1  202003      3    0.0
3    1  202004      3    0.0
4    2  202001      4    NaN
5    2  202002      4    NaN
6    2  202003      5    1.0
7    2  202004      5    1.0
8    3  202001      2    NaN
9    3  202002      3    NaN
10   3  202003      3    1.0
11   3  202004      3    0.0

Upvotes: 2

Ynjxsjmh
Ynjxsjmh

Reputation: 30050

You can just operate shift comparison with original in transform

df['check'] = df.sort_values(['id', 'month']).groupby('id')['index'].transform(lambda group: group != group.shift(2)).astype(int)
print(df)
    id   month  index  check
0    1  202001      3      1
1    1  202002      3      1
2    1  202003      3      0
3    1  202004      3      0
4    2  202001      4      1
5    2  202002      4      1
6    2  202003      5      1
7    2  202004      5      1
8    3  202001      2      1
9    3  202002      3      1
10   3  202003      3      1
11   3  202004      3      0

Then change the first two rows in check column to ''

df['check'] = df.sort_values(['id', 'month']).groupby('id')['check'].transform(lambda group: ['', '']+list(group[2:]))
print(df)

    id   month  index check
0    1  202001      3      
1    1  202002      3      
2    1  202003      3     0
3    1  202004      3     0
4    2  202001      4      
5    2  202002      4      
6    2  202003      5     1
7    2  202004      5     1
8    3  202001      2      
9    3  202002      3      
10   3  202003      3     1
11   3  202004      3     0

Upvotes: 1

Related Questions