user3176500
user3176500

Reputation: 389

Pandas grouping and summing just a certain column

below is a minimal example, showing the problem that I am facing. Let our initial state be the following (I only use dictionary for the purpose of demonstration):

A = [{'D': '16.5.2013', 'A':1, 'B': 0.0, 'C': 2}, {'D': '16.5.2013', 'A':1, 'B': 0.0, 'C': 4}, {'D': '16.5.2013', 'A':1, 'B': 0.5, 'C': 7}]
df = pd.DataFrame(A)
>>> df
   A    B  C          D
0  1  0.0  2  16.5.2013
1  1  0.0  4  16.5.2013
2  1  0.5  7  16.5.2013

How do I get from df to df_new which is:

A_new = [{'D': '16.5.2013', 'A':1, 'B': 0.0, 'C': 6}, {'D': '16.5.2013', 'A':1, 'B': 0.5, 'C': 7}]
df_new = pd.DataFrame(A_new)

>>> df_new
   A    B  C          D
0  1  0.0  6  16.5.2013
1  1  0.5  7  16.5.2013

The first and the second rows of the 'C' column are summed, because 'B' is the same for these two rows. The rest is left the same, for instance, column 'A' is not summed, column 'D' is unchanged. How do I do that assuming I only have df and I want to get df_new. I would really like to find some kind of elegant solution if possible.

Thanks in advance.

Upvotes: 1

Views: 3160

Answers (2)

Woody Pride
Woody Pride

Reputation: 13955

If A, and D are always equal when grouping by B, then you can can just group by A, B D, and sum C:

df.groupby(['A', 'B', 'D'], as_index = False).agg(sum)

Output:

   A    B          D  C
0  1  0.0  16.5.2013  6
1  1  0.5  16.5.2013  7

Alternatively:

You essentially want to aggregate the data grouped by column 'B'. To aggregate column C you will just use the built in sum function. For the other columns, you basically just want to select a sole value as you believe they are always the same within groups. To do that, just write a very simple function that aggregates those columns simply by taking the first value.

# will take first value of the grouped data
sole_value = lambda x : list(x)[0]

#dictionary that maps columns to aggregation functions
agg_funcs = {'A' : sole_value, 'C' : sum, 'D' : sole_value}

#group and aggregate
df.groupby('B', as_index = False).agg(agg_funcs)

Output:

     B  A  C          D
0  0.0  1  6  16.5.2013
1  0.5  1  7  16.5.2013

Of course you really need to be sure that you have values that are definitely equal in columns A, and D, otherwise you might be preserving the wrong data.

Upvotes: 0

joris
joris

Reputation: 139142

Assuming the other columns are always the same, and should not be treated specially.

First create the df_new grouped by B where I take for each column the first row in the group:

In [17]: df_new = df.groupby('B', as_index=False).first()

and then calculate specificaly the C column as a sum for each group:

In [18]: df_new['C'] = df.groupby('B', as_index=False)['C'].sum()['C']

In [19]: df_new
Out[19]: 
     B  A  C          D
0  0.0  1  6  16.5.2013
1  0.5  1  7  16.5.2013

If you have a limited number of columns, you can also do this in one step (but the above will be handier (less manual) if you have more columns) by specifying the desired function for each column:

In [20]: df_new = df.groupby('B', as_index=False).agg({'A':'first', 'C':'sum', 'D':'first'})

Upvotes: 2

Related Questions