Efficient way to create a new column that is the sum of unique values grouped by another column?

Question

I have a dataframe which has a name and size column. I need to create a new column (size_total) that is the total of all unique size values grouped by the name column.

df = pd.DataFrame([     
    {'sp': 'A', 'size': 25},     
    {'sp': 'B', 'size': 26},     
    {'sp': 'C', 'size': 27},     
    {'sp': 'A', 'size': 25},     
    {'sp': 'B', 'size': 42},     
    {'sp': 'C', 'size': 27},     
    {'sp': 'A', 'size': 25},     
    {'sp': 'B', 'size': 32},     
    {'sp': 'C', 'size': 33}])

df['size_total'] = df['sp'].map(df[['sp', 'size']].drop_duplicates(keep='first').groupby('sp').sum()['size'])

Is there a more efficient way to do this?

cs95 · Accepted Answer

If you set "sp" as the index first, you can shave off ~30% in timings from your original solution by making pandas do the mapping.

df.set_index(df['sp'].values, inplace=True)
df['size_total'] = (
    df.drop_duplicates(['sp', 'size']).groupby('sp')['size'].sum(level=0))

Efficient way to create a new column that is the sum of unique values grouped by another column?

Answers (1)

Related Questions