wjie08
wjie08

Reputation: 445

Comparing first and last row of groupby and creating new value

I have a dataframe with multiple values and would like to groupby the 'email' column, retrieve the first row and last row, and compare to see if there is a change in the status of the category column. For example, if the category is from MGR to MGR, then there is no change. If the category changes from EMP to MGR, then it reflects a change in status.

date                 email               category
13-04-2018            [email protected]     MGR
13-04-2018            [email protected]      EMP
18-04-2018            [email protected]     EMP
20-04-2018            [email protected]      MGR
11-01-2019            [email protected]     MGR
15-10-2019            [email protected]     MGR
16-11-2019            [email protected]     MGR
31-01-2020            [email protected]      EMP
02-05-2020            [email protected]      MGR
05-08-2020            [email protected]      MGR
14-02-2021            [email protected]      MGR
15-02-2021            [email protected]      MGR

Would like to get the following results

date                 email               category    status
13-04-2018            [email protected]     MGR        no change
15-10-2019            [email protected]     MGR        no change
13-04-2018            [email protected]      EMP        change
15-02-2021            [email protected]      MGR        change
18-04-2018            [email protected]     EMP        change 
16-11-2019            [email protected]     MGR        change 
20-04-2018            [email protected]      MGR        no change
05-08-2020            [email protected]      MGR        no change
31-01-2020            [email protected]      EMP        change 
14-02-2021            [email protected]      MGR        change

I've tried the following code, but it seems to only retrieve the first and last rows based on the groupby. Is there some method to compare the values between the first and last row?

#get the first and last row of the groupby
df2 = df.groupby('email', as_index=False).nth([0,-1])

appreciate any form of help, thank you.

Upvotes: 0

Views: 795

Answers (2)

Pygirl
Pygirl

Reputation: 13349

try:

This will check whether groupby elements have the consecutive(same values or not) if there is any change it will set the flag.

fl = lambda s: s.iloc[[0,-1]]
res = df.groupby('email')['category'].apply(lambda x: (fl(x).shift(1).ne(fl(x)) & (fl(x).nunique()>1)))
res.index = res.index.droplevel()
df['status'] = res
df.dropna(inplace=True)
df['status'] = np.where(df.status,  'Change', 'No Change')

df.sort_values(by='email'):

date email category status
0 13-04-2018 [email protected] MGR No Change
5 15-10-2019 [email protected] MGR No Change
2 18-04-2018 [email protected] EMP Change
6 16-11-2019 [email protected] MGR Change
1 13-04-2018 [email protected] EMP Change
11 15-02-2021 [email protected] MGR Change
3 20-04-2018 [email protected] MGR No Change
9 05-08-2020 [email protected] MGR No Change
7 31-01-2020 [email protected] EMP Change
10 14-02-2021 [email protected] MGR Change

Upvotes: 0

Zebartin
Zebartin

Reputation: 124

Not sure if it is effecient enough, but it works well.

def check_status(group):
    selected = [False] * len(group)
    selected[0] = selected[-1] = True
    new_group = group[selected]
    new_group['status'] = 'change' if new_group.category.is_unique else 'no change'
    return new_group

print(df.groupby('email').apply(check_status).reset_index(drop=True))

Upvotes: 2

Related Questions