Reputation: 3722
I'm trying to calculate a version of percent change where the formula is (b-a)/(b+a)
(if both a and b are 0, then it should return 0)
I need to groupby a dataframe, sort it, and then apply this function to each group.
I'm running into errors relating to multi-indexes amongst other things.
import random
import pandas as pd
random.seed(42)
group = [random.choice(['a', 'b', 'c', 'd', 'e']) for _ in range(300)]
group2 = [random.choice(['a', 'b', 'c', 'd', 'e']) for _ in range(300)]
x = [random.randint(1, 1000) for _ in range(300)]
number = [random.randint(1, 1000) for _ in range(300)]
df = pd.DataFrame({'group':group, 'group2':group2, 'number':number, 'x':x})
df.sort_values('number').groupby(['group', 'group2'])['x'].groups.keys()
# lets look at a specific group:
df.sort_values('number').groupby(['group', 'group2'])['x'].get_group(('a', 'a'))
206 349
1 226
114 965
48 771
228 662
157 471
128 701
201 500
I need to group by the two groups, sort by the number, and then I want the difference by
(b-a)/(a+b)
I've tried creating a solution thats hacky and doesn't really work when applied against a grouped dataframe. One other thing, I did pd.concat()
to create a dataframe because the index was coming back as 0+ when in reality the grouped dataframe had different indexes. I did this to retain the index. I know there must be better ways.
def relative_diff(x):
df = pd.concat([x.shift(1), x], axis=1).assign(newcol=np.nan)
for i, a, b in zip(range(len(df)), df.iloc[:,0], df.iloc[:,1]):
if a == 0 and b == 0:
df.iloc[i, 2] = 0
else:
df.iloc[i, 2] = ((b-a)/(b+a))
return df.iloc[:,2]
when I apply this against a specific group I get exactly what I'm looking for:
example = df.sort_values('number').groupby(['group', 'group2'])['x'].get_group(('a', 'a'))
relative_diff(example)
206 NaN
1 -0.213913
114 0.620487
48 -0.111751
228 -0.076064
157 -0.168579
128 0.196246
201 -0.167361
I'm now trying to .apply()
this function to the grouped dataframe:
df.sort_values('number').groupby(['group', 'group2'])['x'].apply(relative_diff)
I get the cannot handle a non-unique multi-index!
error, and I'm stuck.
EDIT:
After reviewing WeNYoBen's answer, it works perfectly for the above dataset, but for whatever reason when I look at the first group of my actual data, I'm seeing:
144433 11
535075 6
725163 5
211134 3
89080 3
236916 1
593010 1
648680 2
56196 1
404572 2
724103 3
663501 7
its giving me the error Buffer has wrong number of dimensions (expected 1, got 0)
Upvotes: 1
Views: 45
Reputation: 150735
IIUC, you just want to calculate the shifted series first, then can perform normal operations:
df = df.sort_values('number')
shifted = df.groupby(['group', 'group2'])['x'].shift()
(shifted - df['x']).div(shifted + df['x']).fillna(0)
Output:
134 0.000000
7 0.050592
103 0.000000
204 0.000000
98 0.000000
...
21 0.055165
229 -0.204819
185 0.372709
263 0.116757
110 -0.161116
Length: 300, dtype: float64
Upvotes: 0
Reputation: 323226
We can do
pd.concat([relative_diff(y) for x , y in df.sort_values('number').groupby(['group', 'group2'])['x']])
Upvotes: 1