Reputation: 1020
I am trying to generate summary statistics for a very large dataframe of hundreds of columns and summarize their relationship with an "outcome" of interest. A simplified dataframe for this can be generated using the code below:
df1 = pd.DataFrame({"time1": [0, 1, 1, 0, 0],
"time2": [1, 0, 0, 0, 1],
"time3": [0, 0, 0, 1, 0],
"outcome": [1, 0, 0, 1, 0]})
What I want to do is determine for each column their relationship with the outcome feature in terms of proportion and sum.
Currently, I am doing this a few columns at a time as follows:
df1 = df1.groupby("outcome")[["time1", "time2", "time3"]].agg(["mean", "sum"]).reset_index()
df1[df1["outcome"] == 1].T
This results in a fairly confusing dataframe as follows:
1
outcome 1.0
time1 mean 0.0
sum 0.0
time2 mean 0.5
sum 1.0
time3 mean 0.5
sum 1.0
How can I improve this output to show for each column the mean and sum in individual columns? Something like the output shown below.
mean sum
time1 0 0
time2 0.5 1
time3 0.5 1
Ideally I would like to do this for hundreds of columns in my dataframe and evaluate their relationship with outcome.
So can somebody please guide me to a solution that would allow me to do this for hundreds of columns (without typing their names individually would be the solution) and have the result in a clean dataframe as shown in example output above? Much appreciated!
Upvotes: 2
Views: 219
Reputation: 35646
As mentioned by @sammywemmy, we can unstack
after calculating the values. We can also use loc
instead of reset_index
to select outcome==1
from the index:
df1 = (
df1.groupby("outcome")
.agg(["mean", "sum"]) # Perform Aggregations
.loc[1] # Select outcome==1 from index
.unstack() # convert index to columns
)
We could also filter before groupby agg
then stack
and droplevel
:
df1 = (
df1[df1["outcome"] == 1] # Filter DataFrame
.groupby("outcome") # Groupby
.agg(["mean", "sum"]) # Perform Aggregations
.stack(0) # Convert columns to rows
.droplevel(0) # Drop outcome==1
)
Or set_index
+ stack
first then groupby agg
on the index:
df1 = (
df1.set_index('outcome').stack() # Convert time columns to rows
.groupby(level=[0, 1]) # Groupby
.agg(['mean', 'sum']) # Perform Aggregations
.loc[1] # Select outcome==1 from index
)
Or with pivot_table
and multiple aggregation functions:
df1 = (
df1.pivot_table(index='outcome', aggfunc=['mean', 'sum'])
.loc[1] # Select outcome==1 from index
.unstack(0) # convert inner index to columns
)
All produce:
mean sum
time1 0.0 0.0
time2 0.5 1.0
time3 0.5 1.0
Upvotes: 2