Reputation: 65
This is very similar to the question i asked yesterday. The aim is to be able to add a functionality which will allow for a column to be created depending on the value shown in another. For example, when it finds a country code in a specified file, i would like it to create a column with the name 'Country Code Total', and sum the amount of units for every row with that same country code
This is what my script outputs at the moment:
What i want to see:
My Script:
df['Sum of Revenue'] = df['Units Sold'] * df['Dealer Price']
df['AR Revenue'] = df[]
df = df.sort_values(['End Consumer Country', 'Currency Code'])
# Sets first value of index by position
df.loc[df.index[0], 'Unit Total'] = df['Units Sold'].sum()
# Sets first value of index by position
df.loc[df.index[0], 'Total Revenue'] = df['Sum of Revenue'].sum()
# Sums the amout of Units with the End Consumer Country AR
df['AR Total'] = df.loc[df['End Consumer Country'] == 'AR', 'Units Sold'].sum()
# Sums the amount of Units with the End Consumer Country AU
df['AU Total'] = df.loc[df['End Consumer Country'] == 'AU', 'Units Sold'].sum()
# Sums the amount of Units with the End Consumer Country NZ
df['NZ Total'] = df.loc[df['End Consumer Country'] == 'NZ', 'Units Sold'].sum()
However, as i know the countries that will come up in this file, i have added them accordingly to my script to find. How would i write my script so that if it finds another country code, for example GB, it would create a column called 'GB Total' and sum the units for every row with the country code set to GB.
Any help would be greatly appreciated!
Upvotes: 0
Views: 73
Reputation: 59579
If you truly need that format, then here is how I would proceed (starting data below):
# Get those first two columns
d = {'Sum of Revenue': 'Total Revenue', 'Units Sold': 'Total Sold'}
for col, newcol in d.items():
df.loc[df.index[0], newcol] = df[col].sum()
# Add the rest for every country:
s = df.groupby('End Consumer Country')['Units Sold'].sum().to_frame().T.add_suffix(' Total')
s.index = [df.index[0]]
df = pd.concat([df, s], 1, sort=False)
df
: End Consumer Country Sum of Revenue Units Sold Total Revenue Total Sold AR Total AU Total NZ Total US Total
a AR 13.486216 1 124.007334 28.0 3.0 7.0 11.0 7.0
b AR 25.984073 2 NaN NaN NaN NaN NaN NaN
c AU 21.697871 3 NaN NaN NaN NaN NaN NaN
d AU 10.962232 4 NaN NaN NaN NaN NaN NaN
e NZ 16.528398 5 NaN NaN NaN NaN NaN NaN
f NZ 29.908619 6 NaN NaN NaN NaN NaN NaN
g US 5.439925 7 NaN NaN NaN NaN NaN NaN
As you can see, pandas
added a bunch of NaN
values as we only assigned something to the first row, and a DataFrame
must be rectangular
It's far simpler to have a different DataFrame
that summarizes the totals and within each country. If this is fine, then everything simplifies to a single .pivot_table
df.pivot_table(index='End Consumer Country',
values=['Sum of Revenue', 'Units Sold'],
margins=True,
aggfunc='sum').T.add_suffix(' Total)
End Consumer Country AR Total AU Total NZ Total US Total All Total
Sum of Revenue 39.470289 32.660103 46.437018 5.439925 124.007334
Units Sold 3.000000 7.000000 11.000000 7.000000 28.000000
Same information, much simpler to code.
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'End Consumer Country': ['AR', 'AR', 'AU', 'AU', 'NZ', 'NZ', 'US'],
'Sum of Revenue': np.random.normal(20,6,7),
'Units Sold': np.arange(1,8,1)},
index = list('abcdefg'))
End Consumer Country Sum of Revenue Units Sold
a AR 13.486216 1
b AR 25.984073 2
c AU 21.697871 3
d AU 10.962232 4
e NZ 16.528398 5
f NZ 29.908619 6
g US 5.439925 7
Upvotes: 1