Matt W.
Matt W.

Reputation: 3722

calculating unique percent change in dataframe series groupby

I'm trying to calculate a version of percent change where the formula is (b-a)/(b+a) (if both a and b are 0, then it should return 0)

I need to groupby a dataframe, sort it, and then apply this function to each group.

I'm running into errors relating to multi-indexes amongst other things.

Data:

import random
import pandas as pd

random.seed(42)
group = [random.choice(['a', 'b', 'c', 'd', 'e']) for _ in range(300)]
group2 = [random.choice(['a', 'b', 'c', 'd', 'e']) for _ in range(300)]
x = [random.randint(1, 1000) for _ in range(300)]
number = [random.randint(1, 1000) for _ in range(300)]

df = pd.DataFrame({'group':group, 'group2':group2, 'number':number, 'x':x})
df.sort_values('number').groupby(['group', 'group2'])['x'].groups.keys()

# lets look at a specific group:
df.sort_values('number').groupby(['group', 'group2'])['x'].get_group(('a', 'a'))
206    349
1      226
114    965
48     771
228    662
157    471
128    701
201    500

I need to group by the two groups, sort by the number, and then I want the difference by

(b-a)/(a+b)

I've tried creating a solution thats hacky and doesn't really work when applied against a grouped dataframe. One other thing, I did pd.concat() to create a dataframe because the index was coming back as 0+ when in reality the grouped dataframe had different indexes. I did this to retain the index. I know there must be better ways.

def relative_diff(x):
    df = pd.concat([x.shift(1), x], axis=1).assign(newcol=np.nan)
    for i, a, b in zip(range(len(df)), df.iloc[:,0], df.iloc[:,1]):
        if a == 0 and b == 0:
            df.iloc[i, 2] = 0
        else:
            df.iloc[i, 2] = ((b-a)/(b+a))
    return df.iloc[:,2]

when I apply this against a specific group I get exactly what I'm looking for:

example = df.sort_values('number').groupby(['group', 'group2'])['x'].get_group(('a', 'a'))

relative_diff(example)

206         NaN
1     -0.213913
114    0.620487
48    -0.111751
228   -0.076064
157   -0.168579
128    0.196246
201   -0.167361

I'm now trying to .apply() this function to the grouped dataframe:

df.sort_values('number').groupby(['group', 'group2'])['x'].apply(relative_diff)

I get the cannot handle a non-unique multi-index! error, and I'm stuck.

EDIT:

After reviewing WeNYoBen's answer, it works perfectly for the above dataset, but for whatever reason when I look at the first group of my actual data, I'm seeing:

144433    11
535075     6
725163     5
211134     3
89080      3
236916     1
593010     1
648680     2
56196      1
404572     2
724103     3
663501     7

its giving me the error Buffer has wrong number of dimensions (expected 1, got 0)

Upvotes: 1

Views: 45

Answers (2)

Quang Hoang
Quang Hoang

Reputation: 150735

IIUC, you just want to calculate the shifted series first, then can perform normal operations:

df = df.sort_values('number')

shifted = df.groupby(['group', 'group2'])['x'].shift()
(shifted - df['x']).div(shifted + df['x']).fillna(0)

Output:

134    0.000000
7      0.050592
103    0.000000
204    0.000000
98     0.000000
         ...   
21     0.055165
229   -0.204819
185    0.372709
263    0.116757
110   -0.161116
Length: 300, dtype: float64

Upvotes: 0

BENY
BENY

Reputation: 323226

We can do

pd.concat([relative_diff(y) for x , y in df.sort_values('number').groupby(['group', 'group2'])['x']])

Upvotes: 1

Related Questions