Adam
Adam

Reputation: 990

Dataframe Groupby to combine multiple rows, summing float-type columns

Terrible title, but Here goes. I have a 13,000 x 91 dataframe. 26 of the columns are numeric. The rows are individual projects, with project performance split by year. Like this:

| Year | Control | Description   | USD_Cost | USD_Profit |
|------|---------|---------------|----------|------------|
| 1991 | A1      | A description | 1        | 2          |
| 1992 | A1      | A Description | 100      | 300        |
| 1991 | B1      | B Description | 3        | 50         |
| 1995 | C1      | C Description | 5        | 10         |
| 1990 | D1      | D Description | 2        | 1          |
| 1996 | D1      | D Description | 1        | 1          |

Rather than recording how each project performed in each particular year, I just want to record how long each project lasted, and the overall project performance:

| Years | Control | Description   | USD_Cost | USDProfit |
|-------|---------|---------------|----------|-----------|
| 2     | A1      | A description | 101      | 302       |
| 1     | B1      | B Description | 3        | 50        |
| 1     | C1      | C Description | 5        | 10        |
| 2     | D1      | D Description | 3        | 2         |

Control and Description aren't changed, but the numeric columns beginning with USD are summed across rows. And there are 26 USD columns for different performance aspects. There are about 8000 unique Control Numbers but 13000 total Year-ControlNumber combos.

I know how to groupby for one element (e.g. print(dft.groupby(['Control'])['USD_Cost', 'USD_Profit'].sum() ) but when I do that, I think I am losing all the non-numeric columns. Also, I would like to avoid typing in the names of all 26 USD columns.

Can this be done with groupby?

Upvotes: 0

Views: 769

Answers (3)

igrinis
igrinis

Reputation: 13666

This is a really common kind of operation, and pandas has an elegant way of doing it. To avoid a tedious task of replicating 26 summation functions, I use dictionary comprehension.

First you define a dictionary of actions by column, then you use agg function:

df = pd.DataFrame({
    "Year": [1991, 1992, 1991, 1995, 1990, 1996],
    "Control": ["A1", "A1", "B1", "C1", "D1", "D1"],
    "Description": [
        "A description",
        "A description",
        "B description",
        "C description",
        "D description",
        "D description"
    ],
    "USD_Cost": [1, 100, 3, 5, 2, 1],
    "USD_Profit": [2, 300, 50, 10, 1, 1],
})

actions = {'Year': pd.Series.nunique,
           'Description': lambda x: x.iloc[0]}
actions.update({x: sum for x in df.columns if x.startswith('USD_')})

df.groupby('Control').agg(actions).reset_index()

And this provides

  Control  Year    Description  USD_Cost  USD_Profit
0      A1     2  A description       101         302
1      B1     1  B description         3          50
2      C1     1  C description         5          10
3      D1     2  D description         3           2

Upvotes: 1

Luapulu
Luapulu

Reputation: 192

So my solution would be to group by "Control" and then apply a function to each group, which takes all non-numeric data from the first row (I'm assuming all rows are the same for non-numeric data), but takes the sum for all numeric data. Years are treated separately since you don't want to sum the number of years.

My code:

import pandas as pd
import numpy as np


def sum_project(project):
    # Since only numeric data and years are different,
    # we just take the first row
    project_summed = project.iloc[0, :]

    # sum all numeric data but exclude "Year"
    cols_numeric = project.select_dtypes([np.number]).columns
    cols_numeric = cols_numeric.drop(["Year"])
    project_summed[cols_numeric] = project[cols_numeric].sum()

    # Get year number
    project_summed["Year"] = len(project)

    return project_summed


df = pd.DataFrame({
    "Year": [1991, 1992, 1991, 1995, 1990, 1996],
    "Control": ["A1", "A1", "B1", "C1", "D1", "D1"],
    "Description": [
        "A description",
        "A description",
        "B description",
        "C description",
        "D description",
        "D description"
    ],
    "USD_Cost": [1, 100, 3, 5, 2, 1],
    "USD_Profit": [2, 300, 50, 10, 1, 1],
})

findal_df = df.groupby(["Control"]).apply(sum_project)

this gives final_df:

         Year Control    Description  USD_Cost  USD_Profit
Control                                                   
A1          2      A1  A description       101         302
B1          1      B1  B description         3          50
C1          1      C1  C description         5          10
D1          2      D1  D description         3           2

Upvotes: 1

I think this should work for you

columns = list(filter(lambda x: 'USD' in x, df.columns))
df.groupby(['Control', 'Description'])[columns].sum()

That would bring you all columns grouped by Control, Description. This wouldn't be a problem for what you're up to, so I think this would be the best way.

Upvotes: 1

Related Questions