Reputation: 996

Groupby based on selected condition

I have the following table

+----+-----+--------+------------+
| ID | Age | Amount |    Date    |
+----+-----+--------+------------+
|  0 |  50 |     57 | 01/01/2001 |
|  0 |  63 |    206 | 02/01/2001 |
|  2 |  32 |     27 | 03/01/2001 |
|  0 |  61 |    613 | 04/01/2001 |
|  1 |  30 |    929 | 05/01/2001 |
|  0 |  62 |     59 | 06/01/2001 |
|  1 |  32 |    430 | 07/01/2001 |
|  1 |  24 |    561 | 08/01/2001 |
+----+-----+--------+------------+

How can i perform a groupby operation that only runs on the rows that match a certain condition`, leaving the other rows unaltered?

For example, my condition is ID == 1, and i want to perform the following operations:

Age - get the mean value

Amount - get the sum

Date - get the last date

Expected result:

+----+------+--------+------------+
| ID | Age  | Amount |    Date    |
+----+------+--------+------------+
|  0 | 50   |     57 | 01/01/2001 |
|  0 | 63   |    206 | 02/01/2001 |
|  2 | 32   |     27 | 03/01/2001 |
|  0 | 61   |    613 | 04/01/2001 |
|  1 | 28.7 |   1920 | 08/01/2001 |
|  0 | 62   |     59 | 06/01/2001 |
+----+------+--------+------------+

Upvotes: 0

Answers (3)

Georgina Skibinski

Reputation: 13377

Try:

df=(
    df.drop(df.loc[df["ID"]==1].index, axis=0)
    .append(
       df.loc[df["ID"]==1].groupby("ID", as_index=False)
       .agg({"Age": "sum", "Amount": "sum", "Date": "last"})
    , ignore_index=True)
    )

Outputs:

   ID  Age  Amount        Date
0   0   50      57  01/01/2001
1   0   63     206  02/01/2001
2   2   32      27  03/01/2001
3   0   61     613  04/01/2001
4   0   62      59  06/01/2001
5   1   86    1920  08/01/2001

Upvotes: 1

sammywemmy

Reputation: 28659

This pulls out the conditional rows (ID eq 1), does a groupby on them and store the output in a conditional_groupby variable.
Find the indices for the conditionals in the original dataframe, keep the first row, and discard the rest. insert groupby in the saved index.

conditional_groupby = (df.copy()
                    .query('ID==1')
                    .groupby('ID').agg(Age=('Age','mean'),
                                       Amount=('Amount','sum'),
                                       Date = ('Date','last'))
                    .reset_index()
                    )

drop_them = df.loc[df.ID==1].index[1:]
insert_point = df.loc[df.ID==1].index[0]
df = df.drop(drop_them)

df.loc[insert_point] = conditional_groupby.iloc[0]
df['Age'] = df.Age.round(2)

Upvotes: 1

Sayandip Dutta

Reputation: 15872

You can try this:

>>> df.loc[df['ID'] == 2, 'Value'] = sum(df.loc[df['ID'] == 2, 'Value'])
>>> df.drop_duplicates()
   ID  Value
0   2     20
1   0      9
2   1      0
4   0     10

EDIT:

If you want to sum multiple cols:

>>> df
   ID  Value  col3
0   2      7     8
1   0      9    10
2   1      0     1
3   2      4     5
4   0     10    11
5   2      9    10
>>> df['col3'] = df['Value'] + 1

>>> df.loc[df['ID'] == 2, ['Value', 'col3']] = (df.loc[df['ID'] == 2, ['Value', 'col3']]
                                                .sum().values)
# if you have many cols, then instead of specifying the cols as list, use this:
# df.loc[df['ID'] == 2, df.columns[1:]] = (df.loc[df['ID'] == 2, df.columns[1:]]
#                                                .sum().values)
>>> df.drop_duplicates()
   ID  Value  col3
0   2     20    23
1   0      9    10
2   1      0     1
4   0     10    11

If you want to sum only Values but keep other columns:

>>> df
   ID  Value  col3
0   2      7     8
1   0      9    10
2   1      0     1
3   2      4     5
4   0     10    11
5   2      9    10

>>> df.loc[df['ID'] == 2, 'Value'] = sum(df.loc[df['ID'] == 2, 'Value'])
>>> df.drop_duplicates(subset = ['ID', 'Value'])
   ID  Value  col3
0   2     20     8
1   0      9    10
2   1      0     1
4   0     10    11

EDIT 2:

Based on your latest edit, pd.DataFrame.agg() might be best suited for you:

>>> df
+----+-----+--------+------------+
| ID | Age | Amount |    Date    |
+----+-----+--------+------------+
|  0 |  50 |     57 | 01/01/2001 |
|  0 |  63 |    206 | 02/01/2001 |
|  2 |  32 |     27 | 03/01/2001 |
|  0 |  61 |    613 | 04/01/2001 |
|  1 |  30 |    929 | 05/01/2001 |
|  0 |  62 |     59 | 06/01/2001 |
|  1 |  32 |    430 | 07/01/2001 |
|  1 |  24 |    561 | 08/01/2001 |
+----+-----+--------+------------+
>>> df2 = df[df['ID'] == 1]
>>> df.loc[df['ID'] == 1, df.columns[1:]] = (df2.agg({'Age':['mean'], 
                                                      'Amount':['sum'], 
                                                      'Date':[lambda x:x.iloc[-1]]
                                                      })
                                                .bfill()
                                                .ffill()
                                                .reset_index(drop=True)
                                                .values)
>>> df.drop_duplicates()

   ID        Age  Amount        Date
0   0  50.000000    57.0  01/01/2001
1   0  63.000000   206.0  02/01/2001
2   2  32.000000    27.0  03/01/2001
3   0  61.000000   613.0  04/01/2001
4   1  28.666667  1920.0  08/01/2001
5   0  62.000000    59.0  06/01/2001

Upvotes: 2

Groupby based on selected condition

Answers (3)

Related Questions