veg2020
veg2020

Reputation: 1020

groupby pandas for mean and sum under certain condition

I am trying to generate summary statistics for a very large dataframe of hundreds of columns and summarize their relationship with an "outcome" of interest. A simplified dataframe for this can be generated using the code below:

df1 = pd.DataFrame({"time1": [0, 1, 1, 0, 0],
                    "time2": [1, 0, 0, 0, 1],
                    "time3": [0, 0, 0, 1, 0],
                    "outcome": [1, 0, 0, 1, 0]})

What I want to do is determine for each column their relationship with the outcome feature in terms of proportion and sum.

Currently, I am doing this a few columns at a time as follows:

df1 = df1.groupby("outcome")[["time1", "time2", "time3"]].agg(["mean", "sum"]).reset_index()
      
df1[df1["outcome"] == 1].T

This results in a fairly confusing dataframe as follows:

                1
outcome       1.0
time1   mean  0.0
        sum   0.0
time2   mean  0.5
        sum   1.0
time3   mean  0.5
        sum   1.0

How can I improve this output to show for each column the mean and sum in individual columns? Something like the output shown below.

       mean  sum
time1     0    0
time2   0.5    1
time3   0.5    1

Ideally I would like to do this for hundreds of columns in my dataframe and evaluate their relationship with outcome.

So can somebody please guide me to a solution that would allow me to do this for hundreds of columns (without typing their names individually would be the solution) and have the result in a clean dataframe as shown in example output above? Much appreciated!

Upvotes: 2

Views: 219

Answers (1)

Henry Ecker
Henry Ecker

Reputation: 35646

As mentioned by @sammywemmy, we can unstack after calculating the values. We can also use loc instead of reset_index to select outcome==1 from the index:

df1 = (
    df1.groupby("outcome")
        .agg(["mean", "sum"])  # Perform Aggregations
        .loc[1]  # Select outcome==1 from index
        .unstack()  # convert index to columns
)

We could also filter before groupby agg then stack and droplevel:

df1 = (
    df1[df1["outcome"] == 1]  # Filter DataFrame
        .groupby("outcome")  # Groupby
        .agg(["mean", "sum"])  # Perform Aggregations
        .stack(0)  # Convert columns to rows
        .droplevel(0)  # Drop outcome==1
)

Or set_index + stack first then groupby agg on the index:

df1 = (
    df1.set_index('outcome').stack()  # Convert time columns to rows
        .groupby(level=[0, 1])  # Groupby
        .agg(['mean', 'sum'])  # Perform Aggregations
        .loc[1]  # Select outcome==1 from index
)

Or with pivot_table and multiple aggregation functions:

df1 = (
    df1.pivot_table(index='outcome', aggfunc=['mean', 'sum'])
        .loc[1]  # Select outcome==1 from index
        .unstack(0)  # convert inner index to columns
)

All produce:

       mean  sum
time1   0.0  0.0
time2   0.5  1.0
time3   0.5  1.0

Upvotes: 2

Related Questions