mc51
mc51

Reputation: 2277

Determine change in values in a grouped dataframe

Assume a dataset like this (which originally is read in from a .csv):

data = pd.DataFrame({'id': [1,2,3,1,2,3],
                     'time':['2017-01-01 12:00:00','2017-01-01 12:00:00','2017-01-01 12:00:00',
                          '2017-01-01 12:10:00','2017-01-01 12:10:00','2017-01-01 12:10:00'],
                     'value': [10,11,12,10,12,13]})

=>

    id  time                    value
0   1   2017-01-01 12:00:00     10
1   2   2017-01-01 12:00:00     11
2   3   2017-01-01 12:00:00     12
3   1   2017-01-01 12:10:00     10
4   2   2017-01-01 12:10:00     12
5   3   2017-01-01 12:10:00     13

Time is identical for all IDs in each observation period. The series goes on like that for many observations, i.e. every ten minutes.

I want the number of total changes in the value column by id between consecutive times. For example: For id=1 there is no change (result: 0). For id=2 there is one change (result: 1). Inspired by this post, I have tried taking differences: Determining when a column value changes in pandas dataframe

This is what I've come up so far (not working as expected):

data = data.set_index(['id', 'time']) # MultiIndex 
grouped = data.groupby(level='id') 
data['diff'] = grouped['value'].diff()
data.loc[data['diff'].notnull(), 'diff'] = 1
data.loc[data['diff'].isnull(), 'diff'] = 0
grouped['diff'].sum()

However, this will just be the sum of occurrences for each id.

Since my dataset is huge (and wont fit into memory), the solution should be as fast as possible. ( This is why I use a MultiIndex on id + time. I expect significant speedup because optimally the data need not be shuffled anymore.)

Moreover, I have come around dask dataframes which are very similar to pandas dfs. A solution making use of them would be fantastic.

Upvotes: 2

Views: 2276

Answers (3)

BENY
BENY

Reputation: 323226

data.groupby('id').value.agg(lambda x : (x.diff()!=0).sum()).add(-1)
id
1    0
2    1
3    1
Name: value, dtype: int64

Another by using pct_change

data.groupby('id').value.apply(lambda x : (x.pct_change()!=0).sum()).add(-1)
Out[323]: 
id
1    0
2    1
3    1
Name: value, dtype: int64

Upvotes: 3

cs95
cs95

Reputation: 402413

I think you're looking for a groupby and comparison by shift;

data.groupby('id')['value'].agg(lambda x: (x != x.shift(-1)).sum() - 1) 

id
1    0
2    1
3    1
Name: value, dtype: int64

Upvotes: 4

Vaishali
Vaishali

Reputation: 38415

Do you want something like this?

data.groupby('id').value.apply(lambda x: len(set(x)) - 1)

You get

id
1    0
2    1
3    1

Edit: As @COLDSPEED mentioned, if the requirement is to capture change back to a certain value, use

data.groupby('id').value.apply(lambda x: (x != x.shift()).sum() - 1)

Upvotes: 5

Related Questions