Reputation: 23
I'm trying to see if there is a better and more efficient way of doing this:
Sample Data:
df = pd.DataFrame ({'ID' : ['A','A','A','A','B','B','B','B'],
'Month' : [-4,-3,1,2,-3,-2,1,2],
'Cost' : [20,30,10,15,1,2,5,10] })
Then, I groupbyID:
df = df.groupby(ID)
Then, I aggregate the cost with a condition of <0 and >0 in a for loop and save the output to a new dataFrame:
output = pd.DataFrame([])
for group, data in df:
totalPre = 0
totalPost = 0
for row_index, row in data.iterrows():
if row ['Month'] < 0:
totalPre = totalPre + row ['Cost']
elif row['Month'] > 0:
totalPost = totalPost + row ['Cost']
output = output.append(pd.DataFrame({'ID': group, 'Total pre': totalPre,'Total post': totalPost }, index=[0]), ignore_index=True)
And here is the output:
index ID. Total pre Total post
0 A 50 25
1 B 3 15
Thanks.
Upvotes: 2
Views: 217
Reputation: 547
For loops are rarely the only and the best solution in pandas. I would probably create a new column for the pre/post condition and then group by ID
and the new column. Group by creates a DataFrame for every unique combination of values in the columns that you specify and then aggregates the values using a function.
import pandas as pd
import numpy as np
# sample DataFrame
df = pd.DataFrame ({'ID' : ['A','A','A','A','B','B','B','B'],
'Month' : [-4,-3,1,2,-3,-2,1,2],
'Cost' : [20,30,10,15,1,2,5,10] })
# Create a new column `Timepoint` to group by
df['Timepoint'] = (df['Month'] <= 0).replace({True: 'pre', False: 'post'})
# Create a group for each unique combination of `ID` and `Timepoint` and aggregate the `Cost` using the function `sum`.
output = df.groupby(['ID', 'Timepoint'])['Cost'].sum()
My output:
Timepoint post pre
ID
A 25 50
B 15 3
Upvotes: 0
Reputation: 13437
You can do it in several ways.
One way is filtering before groupby
df1 = df[df["Month"]<0].groupby("ID")["Cost"].sum()\
.reset_index(name="Total_pre")
df2 = df[df["Month"]>0].groupby("ID")["Cost"].sum()\
.reset_index(name="Total_post")
out = pd.merge(df1, df2, on="ID", how="outer")
Another is grouping by ID
and condition and then using pd.pivot_table
out = df.groupby(["ID", df["Month"]<0])["Cost"].sum()\
.reset_index()
out = pd.pivot_table(out,
index="ID",
columns="Month",
values="Cost")\
.reset_index()\
.rename(columns={"False":"Total_post",
"True":"Total_pre"})
out.columns.name = None
EDIT
In the last case if you need to exclude the case Month is 0 you can add this filter df["Month"]!=0
before groupby
Upvotes: 0
Reputation: 18367
I believe this is a good and simple alternative!
df_1 = pd.DataFrame([])
df_1 = df_1.assign(totalPre=df[df['Month'] < 0].groupby('ID')['Cost'].sum(),
totalPost= df[df['Month'] > 0].groupby('ID')['Cost'].sum())
print(df_1)
Output:
totalPre totalPost
ID
A 50 25
B 3 15
Upvotes: 2
Reputation: 42896
Using mask
with GroupBy.sum
:
grp = df.mask(df['Month']>0).groupby('ID', as_index=False)['Cost'].sum().rename(columns={'Cost':'Total pre'})
grp['Total post'] = df.mask(df['Month']<0).groupby('ID')['Cost'].sum().to_numpy()
Output
ID Total pre Total post
0 A 50.0 25.0
1 B 3.0 15.0
Details
mask
sets the rows which match the conditions (Month > 0
) to NaN
, so this way we can groupby.sum
and only get the correct rows:
df.mask(df['Month']>0)
ID Month Cost
0 A -4.0 20.0
1 A -3.0 30.0
2 NaN NaN NaN
3 NaN NaN NaN
4 B -3.0 1.0
5 B -2.0 2.0
6 NaN NaN NaN
7 NaN NaN NaN
Upvotes: 1
Reputation: 43504
One way is to filter out the Month==0
1 and then group by the ID
and the condition that the month is less than 0.
output = df[df["Month"]!=0].groupby(["ID", df["Month"]<0])["Cost"].sum()\
.unstack().reset_index().rename_axis(None, axis=1)
.rename(columns={True: "Total pre", False: "Total post"})
print(output)
# ID Total post Total pre
#0 A 25 50
#1 B 15 3
1 Since you seem to indicate that your condition should be strictly greater than/less than.
Upvotes: 0