Naveen Rishishwar
Naveen Rishishwar

Reputation: 36

pandas groupby with exclusion condition inside a group

input DataFrame -- >>

enter image description here

output-->>

enter image description here

this is how I want to calculate the sum. Values in Columns A and B can vary.

enter image description here

I want to aggregate this DataFrame in such a way that for the common value of column A, I have to ignore the only current value of but have to consider all value of B and have to take the sum of column c with respect to the previous condition.

for example-- 1-- for A=a and B= 1, I have to ignore the value of column C for B=1 but I have to add the value of Column C where the value of columns A ==a and b!=1, and sum should be 5 (2+3)

2-- for group A=a and B=2, I have to ignore B=2 but have to consider where A=a and B!=2 so for A=a and B!=2 sum will be 4 (1 +3 ).

3--- for group A=a, B=3, I have to ignore B=3 but have to consider where A=a and B!=3 so for A=a and B!=3 sum will be 3 (1+2).

I have to this thing for the value of a million A, One A can have any number of B value.

Everything should be dynamic.

Thanks :)

Upvotes: 0

Views: 1098

Answers (3)

madamak
madamak

Reputation: 18

I just had a similar issue. Maybe you have solved it now, but here is what I did. I would use a function that computes this special sum that you describe.

def exclusion_sum(row, df):
    exclusion_mask = (df['A'] == row['A']) & (df['B'] != row['B'])
    return df[exclusion_mask]['C'].sum() + row['B']

df['sum'] = df.apply(lambda x: exclusion_sum(x, df), axis=1)

Upvotes: 0

Joe Ferndz
Joe Ferndz

Reputation: 8508

Here's something you can try.

First I am grouping the data based on column A into a new dataframe dfsum

Then I am transforming column C to be the sum. Then subtracting the sum from original column C to get the desired value.

import pandas as pd
df = pd.DataFrame({'A': list('aaabbb'), 'B': [1,2,3,1,2,3], 'C': [1,2,3,4,5,6]})

dfsum = df.groupby(['A'])

n = dfsum['C'].transform('sum')

df['sum'] = (n - df['C'])

print (df)

Output is as follows:

   A  B  C  sum
0  a  1  1    5
1  a  2  2    4
2  a  3  3    3
3  b  1  4   11
4  b  2  5   10
5  b  3  6    9

Upvotes: 0

YOLO
YOLO

Reputation: 21739

You can do using a simple iteration on the dataframe rows:

# get rows except the current show
ss['sum'] = [ss.iloc[ss.index.difference([x]),1].sum() for x in range(ss.shape[0])]

print(ss)

   A  B  sum
0  a  1    5
1  a  2    4
2  a  3    3

Sample Data

ss = pd.DataFrame({'A': list('aaa'), 'B': [1,2,3]})

Upvotes: 1

Related Questions