Reputation: 485
Lets assume the dataframe:
df=
col_1 col_2 col3
id1 val1 1
id2 val1 4
id3 val1 5
id4 val1 7
id1 val2 3
id2 val2 2
id3 val2 8
id4 val2 4
for each row, i need to add a column with the sum of col3
from all other rows where
col2 == row[col2] AND col1 != row[col1]
so i should get:
df=
col_1 col_2 col3 col4
id1 val1 1 16
id2 val1 4 13
id3 val1 5 12
id4 val1 7 10
id1 val2 3 14
id2 val2 2 15
id3 val2 8 9
id4 val2 4 13
I did it using apply and something like
def getVal(row, df):
return df[(df[col1] != row[col1]) & (df[col2] == row[col2])][col3].sum()
df[col4] = df.apply(lambda x: getVal(x, df), axis = 1)
but since my df
has >1000000
rows, it takes forever!!!
Is there a better/faster way?
Thank you in advance
Upvotes: 1
Views: 97
Reputation: 153460
Use groupby
with transform
for this one-liner no merges:
df['col4'] = df.groupby('col_2')['col3'].transform('sum') - df['col3']
Output:
col_1 col_2 col3 col4
0 id1 val1 1 16
1 id2 val1 4 13
2 id3 val1 5 12
3 id4 val1 7 10
4 id1 val2 3 14
5 id2 val2 2 15
6 id3 val2 8 9
7 id4 val2 4 13
Upvotes: 2