Reputation: 445
I have a dataframe with multiple values and would like to groupby the 'email' column, retrieve the first row and last row, and compare to see if there is a change in the status of the category column. For example, if the category is from MGR to MGR, then there is no change. If the category changes from EMP to MGR, then it reflects a change in status.
date email category
13-04-2018 [email protected] MGR
13-04-2018 [email protected] EMP
18-04-2018 [email protected] EMP
20-04-2018 [email protected] MGR
11-01-2019 [email protected] MGR
15-10-2019 [email protected] MGR
16-11-2019 [email protected] MGR
31-01-2020 [email protected] EMP
02-05-2020 [email protected] MGR
05-08-2020 [email protected] MGR
14-02-2021 [email protected] MGR
15-02-2021 [email protected] MGR
Would like to get the following results
date email category status
13-04-2018 [email protected] MGR no change
15-10-2019 [email protected] MGR no change
13-04-2018 [email protected] EMP change
15-02-2021 [email protected] MGR change
18-04-2018 [email protected] EMP change
16-11-2019 [email protected] MGR change
20-04-2018 [email protected] MGR no change
05-08-2020 [email protected] MGR no change
31-01-2020 [email protected] EMP change
14-02-2021 [email protected] MGR change
I've tried the following code, but it seems to only retrieve the first and last rows based on the groupby. Is there some method to compare the values between the first and last row?
#get the first and last row of the groupby
df2 = df.groupby('email', as_index=False).nth([0,-1])
appreciate any form of help, thank you.
Upvotes: 0
Views: 795
Reputation: 13349
try:
This will check whether groupby elements have the consecutive(same values or not) if there is any change it will set the flag.
fl = lambda s: s.iloc[[0,-1]]
res = df.groupby('email')['category'].apply(lambda x: (fl(x).shift(1).ne(fl(x)) & (fl(x).nunique()>1)))
res.index = res.index.droplevel()
df['status'] = res
df.dropna(inplace=True)
df['status'] = np.where(df.status, 'Change', 'No Change')
df.sort_values(by='email'):
date | category | status | ||
---|---|---|---|---|
0 | 13-04-2018 | [email protected] | MGR | No Change |
5 | 15-10-2019 | [email protected] | MGR | No Change |
2 | 18-04-2018 | [email protected] | EMP | Change |
6 | 16-11-2019 | [email protected] | MGR | Change |
1 | 13-04-2018 | [email protected] | EMP | Change |
11 | 15-02-2021 | [email protected] | MGR | Change |
3 | 20-04-2018 | [email protected] | MGR | No Change |
9 | 05-08-2020 | [email protected] | MGR | No Change |
7 | 31-01-2020 | [email protected] | EMP | Change |
10 | 14-02-2021 | [email protected] | MGR | Change |
Upvotes: 0
Reputation: 124
Not sure if it is effecient enough, but it works well.
def check_status(group):
selected = [False] * len(group)
selected[0] = selected[-1] = True
new_group = group[selected]
new_group['status'] = 'change' if new_group.category.is_unique else 'no change'
return new_group
print(df.groupby('email').apply(check_status).reset_index(drop=True))
Upvotes: 2