Reputation: 996
I have the following table
+----+-----+--------+------------+
| ID | Age | Amount | Date |
+----+-----+--------+------------+
| 0 | 50 | 57 | 01/01/2001 |
| 0 | 63 | 206 | 02/01/2001 |
| 2 | 32 | 27 | 03/01/2001 |
| 0 | 61 | 613 | 04/01/2001 |
| 1 | 30 | 929 | 05/01/2001 |
| 0 | 62 | 59 | 06/01/2001 |
| 1 | 32 | 430 | 07/01/2001 |
| 1 | 24 | 561 | 08/01/2001 |
+----+-----+--------+------------+
How can i perform a groupby operation that only runs on the rows that match a certain condition`, leaving the other rows unaltered?
For example, my condition is ID == 1, and i want to perform the following operations:
Age - get the mean value
Amount - get the sum
Date - get the last date
Expected result:
+----+------+--------+------------+
| ID | Age | Amount | Date |
+----+------+--------+------------+
| 0 | 50 | 57 | 01/01/2001 |
| 0 | 63 | 206 | 02/01/2001 |
| 2 | 32 | 27 | 03/01/2001 |
| 0 | 61 | 613 | 04/01/2001 |
| 1 | 28.7 | 1920 | 08/01/2001 |
| 0 | 62 | 59 | 06/01/2001 |
+----+------+--------+------------+
Upvotes: 0
Views: 82
Reputation: 13377
Try:
df=(
df.drop(df.loc[df["ID"]==1].index, axis=0)
.append(
df.loc[df["ID"]==1].groupby("ID", as_index=False)
.agg({"Age": "sum", "Amount": "sum", "Date": "last"})
, ignore_index=True)
)
Outputs:
ID Age Amount Date
0 0 50 57 01/01/2001
1 0 63 206 02/01/2001
2 2 32 27 03/01/2001
3 0 61 613 04/01/2001
4 0 62 59 06/01/2001
5 1 86 1920 08/01/2001
Upvotes: 1
Reputation: 28659
This pulls out the conditional rows (ID eq 1), does a groupby on them and store the output in a conditional_groupby variable.
Find the indices for the conditionals in the original dataframe, keep the first row, and discard the rest. insert groupby in the saved index.
conditional_groupby = (df.copy()
.query('ID==1')
.groupby('ID').agg(Age=('Age','mean'),
Amount=('Amount','sum'),
Date = ('Date','last'))
.reset_index()
)
drop_them = df.loc[df.ID==1].index[1:]
insert_point = df.loc[df.ID==1].index[0]
df = df.drop(drop_them)
df.loc[insert_point] = conditional_groupby.iloc[0]
df['Age'] = df.Age.round(2)
Upvotes: 1
Reputation: 15872
You can try this:
>>> df.loc[df['ID'] == 2, 'Value'] = sum(df.loc[df['ID'] == 2, 'Value'])
>>> df.drop_duplicates()
ID Value
0 2 20
1 0 9
2 1 0
4 0 10
EDIT:
>>> df
ID Value col3
0 2 7 8
1 0 9 10
2 1 0 1
3 2 4 5
4 0 10 11
5 2 9 10
>>> df['col3'] = df['Value'] + 1
>>> df.loc[df['ID'] == 2, ['Value', 'col3']] = (df.loc[df['ID'] == 2, ['Value', 'col3']]
.sum().values)
# if you have many cols, then instead of specifying the cols as list, use this:
# df.loc[df['ID'] == 2, df.columns[1:]] = (df.loc[df['ID'] == 2, df.columns[1:]]
# .sum().values)
>>> df.drop_duplicates()
ID Value col3
0 2 20 23
1 0 9 10
2 1 0 1
4 0 10 11
Values
but keep other columns:>>> df
ID Value col3
0 2 7 8
1 0 9 10
2 1 0 1
3 2 4 5
4 0 10 11
5 2 9 10
>>> df.loc[df['ID'] == 2, 'Value'] = sum(df.loc[df['ID'] == 2, 'Value'])
>>> df.drop_duplicates(subset = ['ID', 'Value'])
ID Value col3
0 2 20 8
1 0 9 10
2 1 0 1
4 0 10 11
EDIT 2:
Based on your latest edit, pd.DataFrame.agg()
might be best suited for you:
>>> df
+----+-----+--------+------------+
| ID | Age | Amount | Date |
+----+-----+--------+------------+
| 0 | 50 | 57 | 01/01/2001 |
| 0 | 63 | 206 | 02/01/2001 |
| 2 | 32 | 27 | 03/01/2001 |
| 0 | 61 | 613 | 04/01/2001 |
| 1 | 30 | 929 | 05/01/2001 |
| 0 | 62 | 59 | 06/01/2001 |
| 1 | 32 | 430 | 07/01/2001 |
| 1 | 24 | 561 | 08/01/2001 |
+----+-----+--------+------------+
>>> df2 = df[df['ID'] == 1]
>>> df.loc[df['ID'] == 1, df.columns[1:]] = (df2.agg({'Age':['mean'],
'Amount':['sum'],
'Date':[lambda x:x.iloc[-1]]
})
.bfill()
.ffill()
.reset_index(drop=True)
.values)
>>> df.drop_duplicates()
ID Age Amount Date
0 0 50.000000 57.0 01/01/2001
1 0 63.000000 206.0 02/01/2001
2 2 32.000000 27.0 03/01/2001
3 0 61.000000 613.0 04/01/2001
4 1 28.666667 1920.0 08/01/2001
5 0 62.000000 59.0 06/01/2001
Upvotes: 2