Reputation: 101
i have a dataframe like below:
Time col1 col2 col3
2 a x 10
3 b y 11
1 a x 10
6 c z 12
20 c x 13
23 a y 24
14 c x 13
16 b y 11
...
and want to add a column to every row of dataframe based on other rows of dataframe, this is out dataframe:
Time col1 col2 col3 cumVal
2 a x 10 2
3 b y 11 1
1 a x 10 2
6 c z 12 1
20 c x 13 2
23 a y 24 1
14 c x 13 2
16 b y 11 1
...
i have a try :
df['cumVal'] = 0
for index, row in df.iterrows():
min1 = row['Time']-10
max1 = row['Time']+10
ndf = df[(df.col1 == row.col1)&(df.col2 == row.col2)& (df.col3 ==
row.col3)]
df.iloc[index]['cumVal'] = len(ndf.query('@min1 <= Time <= @max1'))
but it is very slow, anybody could change my code to get more faster?
Upvotes: 0
Views: 57
Reputation: 29635
You can use groupby
on 'col1', 'col2' and 'col3' and in the transform
per group, use np.subtract
as a ufunc of outer
to calculate all the differences between values in the column 'Time' of this group, then with np.abs
inferior to 10 and np.sum
on axis=0, you can calculate how many values are within +/- 10 for each value.
import numpy as np
df['cumVal'] = (df.groupby(['col1','col2','col3'])['Time']
.transform(lambda x: (np.abs(np.subtract.outer(x, x))<=10).sum(0)))
print (df)
Time col1 col2 col3 cumVal
0 2.0 a x 10.0 2.0
1 3.0 b y 11.0 1.0
2 1.0 a x 10.0 2.0
3 6.0 c z 12.0 1.0
4 20.0 c x 13.0 2.0
5 23.0 a y 24.0 1.0
6 14.0 c x 13.0 2.0
7 16.0 b y 11.0 1.0
Upvotes: 1
Reputation: 2032
It should give better performance:
df['cumVal'] = 0
for index, row in df.iterrows():
min1 = row['Time']-10
max1 = row['Time']+10
ndf = df[(df.Time>min1)&(df.Time<max1)&(df.col1 == row.col1)&(df.col2 == row.col2)& (df.col3 ==
row.col3)]
df.iloc[index]['cumVal'] = len(ndf)
Upvotes: 0