Jane Wilkie
Jane Wilkie

Reputation: 1743

Conditional Summing on python dataframe

I'm just getting into Pandas and trying to generate a spreadsheet for a car lot. I'm loving Pandas but it's slow going and I'm trying to generate some new columns that sum ...

import pandas as pd

data = pd.DataFrame({"Car":["Hyundai","Hyundai","Honda", "Honda"], "Type":["Accent", "Accent", "Civic", "Civic"], "Trans":["Auto", "Manual", "Auto", "Manual"], "TOTAL":[2,4,5,3]})

print data

print data.groupby(['Car', 'Type', 'Trans'])['TOTAL'].sum()

I'm getting the totally predictable ....

       Car  TOTAL   Trans    Type
0  Hyundai      2    Auto  Accent
1  Hyundai      4  Manual  Accent
2    Honda      5    Auto   Civic
3    Honda      3  Manual   Civic

Car      Type    Trans 
Honda    Civic   Auto      5
                 Manual    3
Hyundai  Accent  Auto      2
                 Manual    4

Ideally what I'd love to pull off is.....

Car       Type    Auto    Manual  Total
Honda     Civic     5        3      8
Hyundai   Accent    2        4      6

My knowledge isn't that great of Pandas (yet), but I'm guessing it's an "apply" or an agg() function but so far, syntactically, I'm banging my head from the syntax errors, but I appreciate any pointers in the right direction. .. JW

Upvotes: 1

Views: 247

Answers (2)

David Maust
David Maust

Reputation: 8270

You can prepare two new series ahead of time in the dataframe with auto and manual counts.

data['total_manual'] = data['TOTAL'] * (data['Trans'] == 'Manual').astype(int)
data['total_auto'] = data['TOTAL'] * (data['Trans'] == 'Auto').astype(int)
print data.groupby(['Car', 'Type'])['total_auto', 'total_manual', 'TOTAL'].sum()

Also a similar approach is to use a pivot table with margins.

pvt = pd.pivot_table(data, index=['Car', 'Type'], columns='Trans', values='TOTAL', margins='columns', aggfunc=np.sum)
pvt = pvt.drop(('All',''), axis=0)

Upvotes: 1

Stefan
Stefan

Reputation: 42875

To use the built-in pandas methods, you could: set your 'Car', 'Type', 'Trans' columns as index and unstack() to get the Total for each subgroup, then just sum over the columns:

data = pd.DataFrame({"Car":["Hyundai","Hyundai","Honda", "Honda"], "Type":["Accent", "Accent", "Civic", "Civic"], "Trans":["Auto", "Manual", "Auto", "Manual"], "TOTAL":[2,4,5,3]}).set_index(['Car', 'Type', 'Trans'])

total_by_trans = data.unstack().loc[:, 'TOTAL']         # to get rid of the column MultiIndex created by unstack()
total_by_trans['Total'] = total_by_trans.sum(axis=1)    
total_by_trans.columns.name = None                      # just cleaning up

                Auto  Manual  Total
Car     Type                       
Honda   Civic      5       3      8
Hyundai Accent     2       4      6

Upvotes: 3

Related Questions