Mark Ginsburg
Mark Ginsburg

Reputation: 2269

Getting a YoY percentage from a pandas dataframe

If I have a dataframe that has columns 'count', 'w', 'c', 'd' and 'y' and sample data looks like this:

131    1    dir    mob    2017
244    1    dir    mob    2018
311    1    org    mob    2017
332    1    org    mob    2018
212    2    dir    dsk    2017
311    2    dir    dsk    2018
401    2    org    mob    2017
283    2    org    mob    2018

What is the way to get the percentage change of the 'count' value grouped by 'w','c','d' for 2018 versus 2017?

Upvotes: 1

Views: 2270

Answers (2)

Anton vBR
Anton vBR

Reputation: 18916

Well you could sort your data (as you already have) and then calculate the diff as the difference between count and count.shift().

That difference we assign to the dataframe where we drop all the duplicated rows and cols we are not interested in:

import pandas as pd

data = '''\
count  w      c      d       y
131    1    dir    mob    2017
244    1    dir    mob    2018
311    1    org    mob    2017
332    1    org    mob    2018
212    2    dir    dsk    2017
311    2    dir    dsk    2018
401    2    org    mob    2017
283    2    org    mob    2018'''

df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')

df.sort_values(by=['w','c','d','y'], inplace=True) # <--- sort values 

diff = (df['count'].shift(-1) - df['count'])/df['count'][::2] # <--- calculate % change

dfnew = df.drop_duplicates(('w','c','d')).drop(['y','count'],axis=1).assign(diff=diff)

print(dfnew)

Returns:

   w    c    d      diff
0  1  dir  mob  0.862595
2  1  org  mob  0.067524
4  2  dir  dsk  0.466981
6  2  org  mob -0.294264

You can add this verifier to check that you got the amount of correct columns:

sum(df[['w','c','d']].duplicated()) == len(df)/2  # <-- This should return True

Upvotes: 5

sacuL
sacuL

Reputation: 51365

Here are 2 ways: the first (with agg) is faster than the second (transform), but for the sake of having options....

Nice thing is that they can deal with instances in which you have more than just 2 years to compare between

Method 1

You can sort your dataframe by y, then use agg() with pct_change():

df['pct'] = df.sort_values('y').groupby(['w', 'c', 'd']).agg({'count':'pct_change'})

>>> df
   count  w    c    d     y       pct
0    131  1  dir  mob  2017       NaN
1    244  1  dir  mob  2018  0.862595
2    311  1  org  mob  2017       NaN
3    332  1  org  mob  2018  0.067524
4    212  2  dir  dsk  2017       NaN
5    311  2  dir  dsk  2018  0.466981
6    401  2  org  mob  2017       NaN
7    283  2  org  mob  2018 -0.294264

Method 2

Similar, but using transform() instead of agg:

df['pct'] = df.sort_values('y').groupby(['w', 'c', 'd']).transform(lambda x: x.pct_change())['count']

>>> df
   count  w    c    d     y       pct
0    131  1  dir  mob  2017       NaN
1    244  1  dir  mob  2018  0.862595
2    311  1  org  mob  2017       NaN
3    332  1  org  mob  2018  0.067524
4    212  2  dir  dsk  2017       NaN
5    311  2  dir  dsk  2018  0.466981
6    401  2  org  mob  2017       NaN
7    283  2  org  mob  2018 -0.294264

Upvotes: 4

Related Questions