Reputation: 23
Newbie trying to break my addiction to excel. I have a data set of paid invoices with the vendor and country where it was paid along with the amount. I want know for each vendor, which country they have the greatest invoice amount and what percentage of their total business is in that country. Using this data set I want the result to be:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Company' : ['bar','foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo', 'bar'],
'Country' : ['two','one', 'one', 'two', 'three', 'two', 'two', 'one', 'three', 'one'],
'Amount' : [4, 2, 2, 6, 4, 5, 6, 7, 8, 9],
'Pct' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]})
CoCntry = df.groupby(['Company', 'Country'])
CoCntry.aggregate(np.sum)
After looking at multiple examples including: Extract row with max value and Getting max value using groupby
2: Python : Getting the Row which has the max value in groups using groupby I've gotten as far as creating a DataFrameGroupBy summarizing the invoice data by country. I’m struggling with how to find the max row. After which I must figure out how to calculate the percent. Advice welcome.
Upvotes: 2
Views: 295
Reputation: 863361
You can use transform
for return Series
Pct
of summed values per groups by first level Company
. Then filter Dataframe
by max value per groups with idxmax
and last divide Amount
column with Series
Pct
:
g = CoCntry.groupby(level='Company')['Amount']
Pct = g.transform('sum')
print (Pct)
Company Country
bar one 25
three 25
two 25
foo one 28
three 28
two 28
Name: Amount, dtype: int64
CoCntry = CoCntry.loc[g.idxmax()]
print (CoCntry)
Amount Pct
Company Country
bar one 11 0
foo two 11 0
CoCntry.Pct = CoCntry.Amount.div(Pct)
print (CoCntry.reset_index())
Company Country Amount Pct
0 bar one 11 0.440000
1 foo two 11 0.392857
Similar another solution:
CoCntry = df.groupby(['Company', 'Country']).Amount.sum()
print (CoCntry)
Company Country
bar one 11
three 4
two 10
foo one 9
three 8
two 11
Name: Amount, dtype: int64
g = CoCntry.groupby(level='Company')
Pct = g.sum()
print (Pct)
Company
bar 25
foo 28
Name: Amount, dtype: int64
maxCoCntry = CoCntry.loc[g.idxmax()].to_frame()
maxCoCntry['Pct'] = maxCoCntry.Amount.div(Pct, level=0)
print (maxCoCntry.reset_index())
Company Country Amount Pct
0 bar one 11 0.440000
1 foo two 11 0.392857
Upvotes: 2
Reputation: 294506
setup
df = pd.DataFrame({'Company' : ['bar','foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo', 'bar'],
'Country' : ['two','one', 'one', 'two', 'three', 'two', 'two', 'one', 'three', 'one'],
'Amount' : [4, 2, 2, 6, 4, 5, 6, 7, 8, 9],
})
solution
# sum total invoice per country per company
comp_by_country = df.groupby(['Company', 'Country']).Amount.sum()
# sum total invoice per company
comp_totals = df.groupby('Company').Amount.sum()
# percent of per company per country invoice relative to company
comp_by_country_pct = comp_by_country.div(comp_totals).rename('Pct')
answer to OP question
Which 'Country'
has greatest total invoice for 'Company'
and what percentage of that companies total business.
comp_by_country_pct.loc[
comp_by_country_pct.groupby(level=0).idxmax()
].reset_index()
Company Country Pct
0 bar one 0.440000
1 foo two 0.392857
Upvotes: 2