Reputation: 2269
If I have a dataframe that has columns 'count', 'w', 'c', 'd' and 'y' and sample data looks like this:
131 1 dir mob 2017
244 1 dir mob 2018
311 1 org mob 2017
332 1 org mob 2018
212 2 dir dsk 2017
311 2 dir dsk 2018
401 2 org mob 2017
283 2 org mob 2018
What is the way to get the percentage change of the 'count' value grouped by 'w','c','d' for 2018 versus 2017?
Upvotes: 1
Views: 2270
Reputation: 18916
Well you could sort your data (as you already have) and then calculate the diff as the difference between count and count.shift().
That difference we assign to the dataframe where we drop all the duplicated rows and cols we are not interested in:
import pandas as pd
data = '''\
count w c d y
131 1 dir mob 2017
244 1 dir mob 2018
311 1 org mob 2017
332 1 org mob 2018
212 2 dir dsk 2017
311 2 dir dsk 2018
401 2 org mob 2017
283 2 org mob 2018'''
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
df.sort_values(by=['w','c','d','y'], inplace=True) # <--- sort values
diff = (df['count'].shift(-1) - df['count'])/df['count'][::2] # <--- calculate % change
dfnew = df.drop_duplicates(('w','c','d')).drop(['y','count'],axis=1).assign(diff=diff)
print(dfnew)
Returns:
w c d diff
0 1 dir mob 0.862595
2 1 org mob 0.067524
4 2 dir dsk 0.466981
6 2 org mob -0.294264
You can add this verifier to check that you got the amount of correct columns:
sum(df[['w','c','d']].duplicated()) == len(df)/2 # <-- This should return True
Upvotes: 5
Reputation: 51365
Here are 2 ways: the first (with agg
) is faster than the second (transform
), but for the sake of having options....
Nice thing is that they can deal with instances in which you have more than just 2 years to compare between
Method 1
You can sort your dataframe by y
, then use agg()
with pct_change()
:
df['pct'] = df.sort_values('y').groupby(['w', 'c', 'd']).agg({'count':'pct_change'})
>>> df
count w c d y pct
0 131 1 dir mob 2017 NaN
1 244 1 dir mob 2018 0.862595
2 311 1 org mob 2017 NaN
3 332 1 org mob 2018 0.067524
4 212 2 dir dsk 2017 NaN
5 311 2 dir dsk 2018 0.466981
6 401 2 org mob 2017 NaN
7 283 2 org mob 2018 -0.294264
Method 2
Similar, but using transform()
instead of agg
:
df['pct'] = df.sort_values('y').groupby(['w', 'c', 'd']).transform(lambda x: x.pct_change())['count']
>>> df
count w c d y pct
0 131 1 dir mob 2017 NaN
1 244 1 dir mob 2018 0.862595
2 311 1 org mob 2017 NaN
3 332 1 org mob 2018 0.067524
4 212 2 dir dsk 2017 NaN
5 311 2 dir dsk 2018 0.466981
6 401 2 org mob 2017 NaN
7 283 2 org mob 2018 -0.294264
Upvotes: 4