Reputation: 1079
I have been working with a dataframe having the following format (the actual table has much more rows(id's) and columns(value_3, value_4 etc..)):
where for each id, the status
column has the value 'new' if this is the first entry for that id, and the value 'modified' if any of the value_1
, value_2
columns have changed compared to their previous value. I would like to create a log of any changes made in the table, in particular I would like the resulted format for the given data above to be something like this:
Ideally, I would like to avoid using loops, so could you please suggest any more efficient pythonic way to achieve the format above?
I have seen the answers posted for the question here: Determining when a column value changes in pandas dataframe
which partly do the job I want (using shift
or diff
) for identifying cases where there was a change, and I was wondering if this is the best way to build on for my case, or if there is a more efficient way to do that and speed up the process. Ideally, I would like something that can work for both numeric and non-numeric values in value_1
, value_2
, etc columns..
Code for creating the sample data of the first pic:
import pandas as pd
data = [[1,2,5,'new'], [1,1,5,'modified'], [1,0,5,'modified'],
[2,5,2,'new'], [2,5,3,'modified'], [2,5,4,'modified'] ]
df = pd.DataFrame(data, columns = ['id', 'value_1', 'value_2',
'status'])
df
Many thanks in advance for any suggestion/help!
Upvotes: 0
Views: 306
Reputation: 323226
We do need melt
first then groupby
after drop_duplicates
s = df.melt(['id','status']).drop_duplicates(['id','variable','value'])
s['new'] = s.groupby(['id','variable'])['value'].shift()
s #s.sort_values('id')
id status variable value new
0 1 new value_1 2 NaN
1 1 modified value_1 1 2.0
2 1 modified value_1 0 1.0
3 2 new value_1 5 NaN
6 1 new value_2 5 NaN
9 2 new value_2 2 NaN
10 2 modified value_2 3 2.0
11 2 modified value_2 4 3.0
Upvotes: 1