FunnyChef
FunnyChef

Reputation: 1946

How to calculate data changes over time using Python

For the following dataframe, I need calculate the change in 'count', for each set of date, location_id, uid and include the set in the results.

# Sample DataFrame

df = pd.DataFrame({'date': ['2021-01-01', '2021-01-01','2021-01-01','2021-01-02', '2021-01-02','2021-01-02'], 
                   'location_id':[1001,2001,3001, 1001,2001,3001],
                   'uid': ['001', '003', '002','001', '004','002'],
                   'uid_count':[1, 2,3 ,2, 2, 4]})

    date    location_id uid count
0   2021-01-01  1001    001 1
1   2021-01-01  2001    003 2
2   2021-01-01  3001    002 3
3   2021-01-02  1001    001 2
4   2021-01-02  2001    004 2
5   2021-01-02  3001    002 4

My desired results would look like:

# Desired Results

date        location_id  uid
2021-01-01  1001         001    0
            2001         003    0
            3001         002    0
2021-01-02  1001         001    1
            2001         004    0
            3001         002    1

I thought I could do this via groupby by using the following, but the desired calculation isn't made:

# Current code:
df.groupby(['date','location_id','uid'],sort=False).apply(lambda x: (x['count'].values[-1] - x['count'].values[0]))

# Current results:
date        location_id  uid
2021-01-01  1001         001    0
            2001         003    0
            3001         002    0
2021-01-02  1001         001    0
            2001         004    0
            3001         002    0

How can I get the desired results?

Upvotes: 1

Views: 959

Answers (1)

Trenton McKinney
Trenton McKinney

Reputation: 62413

  • The following code works with the test dataframe, I'm not certain about a larger dataframe
  • .transform() is used to calculate the differences for consecutive occurrences of 'uid_count', for each uid, with the same index as df.
  • The issue with .groupby(['date','location_id','uid'], is that each group only contains a single value.
  • Remove 'uid_count' at the end, with .drop(columns='uid_count'), if desired.
import pandas as pd

# sort the dataframe
df = df.sort_values(['date', 'location_id', 'uid'])

# groupby and transform based on the difference in uid_count
uid_count_diff = df.groupby(['location_id', 'uid']).uid_count.transform(lambda x: x.diff()).fillna(0).astype(int)

# create a column in df
df['uid_count_diff'] = uid_count_diff

# set the index
df = df.set_index(['date', 'location_id', 'uid'])

# result
                            uid_count  uid_count_diff
date       location_id uid                           
2021-01-01 1001        001          1               0
           2001        003          2               0
           3001        002          3               0
2021-01-02 1001        001          2               1
           2001        004          2               0
           3001        002          4               1

Upvotes: 1

Related Questions