user1700890
user1700890

Reputation: 7732

Detect row change by group and bring result back to original data frame

Here is my example. I am grouping, ordering and detecting change from one row to another.

import pandas as pd
import datetime

my_df = pd.DataFrame({'col1': ['a', 'a', 'a', 'a', 'b', 'b', 'b'],
                      'col2': [2, 2, 3, 2, 5, 5, 5],
                      'col3': [datetime.date(2023, 2, 1),
                               datetime.date(2023, 3, 1),
                               datetime.date(2023, 5, 1),
                               datetime.date(2023, 4, 1),
                               datetime.date(2023, 3, 1),
                               datetime.date(2023, 2, 1),
                               datetime.date(2023, 4, 1)]})

my_df_temp = my_df.sort_values(by=['col3']).groupby('col1')['col2'].apply(
    lambda x: x != x.shift(1)
).reset_index(name='col2_change')

Now I would like to bring result back to my_df i.e. I would like my_df to have column col2_change.

Simple assignment will not work my_df['col2_change'] = my_df_temp.col2_change.values

One way I can do it is by ordering my_df by two columns col1 and col3 and then simply assigning, but it looks a bit laborious. Is there an easier way to do it?

Upvotes: 1

Views: 57

Answers (2)

Panda Kim
Panda Kim

Reputation: 13257

Code

As everyone seems to agree, you don't need apply. And in my opinion, you don't need transform either, because pandas series operates comparisons on index anyway.

my_df['col2_change'] = (
    my_df.sort_values('col3')
         .groupby('col1')['col2'].shift()
         .ne(my_df['col2'])
)

Upvotes: 1

Your method is actually quite good, but if you what a more concise method that also elimnates the overhead of resetting, you can do this:

import pandas as pd
import datetime

my_df = pd.DataFrame({'col1': ['a', 'a', 'a', 'a', 'b', 'b', 'b'],
                      'col2': [2, 2, 3, 2, 5, 5, 5],
                      'col3': [datetime.date(2023, 2, 1),
                               datetime.date(2023, 3, 1),
                               datetime.date(2023, 5, 1),
                               datetime.date(2023, 4, 1),
                               datetime.date(2023, 3, 1),
                               datetime.date(2023, 2, 1),
                               datetime.date(2023, 4, 1)]})

my_df['col2_change'] = (
    my_df.sort_values(by='col3')
    .groupby('col1')['col2']
    .transform(lambda x: x != x.shift(1))
)

print(my_df)

which gives

  col1  col2        col3  col2_change
0    a     2  2023-02-01         True
1    a     2  2023-03-01        False
2    a     3  2023-05-01         True
3    a     2  2023-04-01        False
4    b     5  2023-03-01        False
5    b     5  2023-02-01         True
6    b     5  2023-04-01        False

transform broadcasts group-level operations back to the original dataframewhich is more efficient in many cases.

Upvotes: 2

Related Questions