roundsquare
roundsquare

Reputation: 221

How do I find rows in a pandas dataframe where values have changed?

I am in a situation where I get data on various items and sometimes some fields for an item changes. I'm looking for an easy way to detect when something has changed. Here's an example. Let's say have like this:

df = pd.DataFrame({'ID': [1, 1, 1, 2, 3, 2],
                   'SUBID': [1, 2, 1, 1, 1, 1],
                   'val1': [1, 4, 1, 7, 10, 8],
                   'val2': [2, 5, 2, 8, 11, 9],
                   'val3': [3, 6, 4, 9, 12, 9],
                   'date': ['20200413', '20200413', '20200414', '20200414',
                            '20200415', '20200416']})
ID  SUBID   val1    val2    val3    date
1   1       1       2       3       20200413
1   2       4       5       6       20200413
1   1       1       2       4       20200414
2   1       7       8       9       20200414
3   1       10      11      12      20200415
2   1       8       9       9       20200416

I'd like an operation that gives me something like this:

   ID  SUBID   col      date  prev  new
0   1      1  val3  20200414     3    4
1   2      1  val1  20200416     7    8
2   2      1  val2  20200416     8    9

Thanks in advance!

Upvotes: 1

Views: 93

Answers (2)

ansev
ansev

Reputation: 30940

Here is my approach with GroupBy.shift and DataFrame.melt

new_df = (df.melt(['ID','SUBID','date'], value_name='new')
            .assign(prev=lambda x: x.groupby(['ID','SUBID','variable'])['new']
                                    .shift())
            .loc[lambda x: ~x['new'].eq(x['prev']) & x['prev'].notna()]
            .sort_values(['ID','SUBID']))

print(new_df)
    ID  SUBID      date variable  new  prev
14   1      1  20200414     val3    4   3.0
5    2      1  20200416     val1    8   7.0
11   2      1  20200416     val2    9   8.0

Upvotes: 2

Ben.T
Ben.T

Reputation: 29635

IIUC, you can do it by first reshaping the dataframe with set_index, stack, then groupby some columns to agg date and the values columns with first and/or last, then to filter only the one that have changed you can query.

df_f = (df.set_index(['ID', 'SUBID', 'date'])
          .rename_axis(columns='col').stack()
          .reset_index('date', name='val')
          .groupby(['ID', 'SUBID', 'col'])
          .agg(date=('date','last'), prev=('val','first'), new=('val','last'))
          .query('prev != new').reset_index())

print (df_f)
   ID  SUBID   col      date  prev  new
0   1      1  val3  20200414     3    4
1   2      1  val1  20200416     7    8
2   2      1  val2  20200416     8    9

Upvotes: 2

Related Questions