Ankit Goel
Ankit Goel

Reputation: 360

Pandas pivot table with conditional aggfunc

My pandas dataframe is as follows:

df = pd.DataFrame({"PAR NAME":['abc','def','def','def','abc'], "value":[1,2,3,4,5],"DESTCD":['E','N','E','E','S']})

I need to pivot df for PAR NAME and find out what %age of its value comes from places where DESTCD is 'E'. Something like this (which obviously didnt work!)

df.pivot_table(index="PAR NAME",values=["value"],aggfunc={'value':lambda x: (x.sum() if x["DESTCD"]=="E")*100.0/x.sum()})

I am currently doing this through adding a conditional column and then summing it along with 'value' in pivot and then dividing, but my database is huge (1gb+) and there has got to be an easier way.

Edit: Expected Output abc 16.67 (since abc and E is 1 out of total abc which is 6) def 77.78 (since def and E is 7 out of total def of 9);

(Note: Please dont recommend slicing multiple dataframes as mentioned my data is huge and efficiency is critical :) )

Upvotes: 0

Views: 7304

Answers (3)

Ankit Goel
Ankit Goel

Reputation: 360

I also found a way to answer the question via pivot which is equally efficient as the selected answer! Adding here for convenience of others:

df.pivot_table(index="PAR NAME",values=["value"],aggfunc={'value':lambda x: x[df.iloc[x.index]['DESTCD']=='E'].sum()*100.0/x.sum()})

Logic being that aggfunc only works with series in question and cannot reference any other series till you get them via indexing the main df.

Upvotes: 2

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

Instead of pivot table you can use multiple groupby methods based on PAR NAME and then apply the operation you want. i.e

new = df[df['DESTCD']=='E'].groupby('PAR NAME')['value'].sum()*100/df.groupby('PAR NAME')['value'].sum()

Output:

PAR NAME
abc    16.666667
def    77.777778
Name: value, dtype: float64

If you want timings

%%timeit

df[df['DESTCD']=='E'].groupby('PAR NAME')['value'].sum()*100/df.groupby('PAR NAME')['value'].sum()
100 loops, best of 3: 4.03 ms per loop

%%timeit
df = pd.concat([df]*10000)
df[df['DESTCD']=='E'].groupby('PAR NAME')['value'].sum()*100/df.groupby('PAR NAME')['value'].sum()

100 loops, best of 3: 15.6 ms per loop

Upvotes: 2

Simon
Simon

Reputation: 333

I tried to solve the problem without specifically referencing 'E' so it is generalizable to any alphabet letter. The output is a dataframe that you can then index on E to get your answer. I simply did the aggregation separately and then used an efficient join method.

df = pd.DataFrame({"PAR NAME":['abc','def','def','def','abc'], "value":[1,2,3,4,5],"DESTCD":['E','N','E','E','S']})

# First groupby 'DESTCD' and 'PAR NAME'
gb = df.groupby(['DESTCD', 'PAR NAME'], as_index=False).sum()
print(gb)
  DESTCD PAR NAME  value
0      E      abc      1
1      E      def      7
2      N      def      2
3      S      abc      5

gb_parname = gb.groupby(['PAR NAME']).sum()
out = gb.join(gb_parname, on='PAR NAME', rsuffix='Total')
print(out)
  DESTCD PAR NAME  value  valueTotal
0      E      abc      1           6
1      E      def      7           9
2      N      def      2           9
3      S      abc      5           6

out.loc[:, 'derived']= out.apply(lambda df: df.value/df.valueTotal, axis=1)

print(out)
  DESTCD PAR NAME  value  valueTotal   derived
0      E      abc      1           6  0.166667
1      E      def      7           9  0.777778
2      N      def      2           9  0.222222
3      S      abc      5           6  0.833333

It's also a relatively efficient operation

%%timeit
gb = df.groupby(['DESTCD', 'PAR NAME'], as_index=False).sum()
gb_parname = gb.groupby(['PAR NAME']).sum()
out = gb.join(gb_parname, on='PAR NAME', rsuffix='Total')
out.loc[:, 'derived']= out.apply(lambda df: df.value/df.valueTotal, axis=1)
100 loops, best of 3: 6.31 ms per loop

Upvotes: 2

Related Questions