Reputation: 3905
I have a dataframe with 2 columns: class (0/1) and time (integer). I need to append a third column which will be the remaining time to get a class 1 row.
df = pd.DataFrame([
[1,101], [1,104],
[0,107], [0,110], [0,123],
[1,156],
[0,167]],
columns=['class', 'time'])
diff
should be 0.diff
should be the difference between its time
and time
of the first upcoming row that has class 0.I can calculate it in a Lambda function:
df['diff'] = df.apply(lambda x: df[ (df['time'] >= x[1]) & (df['class']==0)]['time'].iloc[0] - x[1], axis=1)
The expression df[ (df['time'] >= x[1]) & (df['class']==0)]
is run for every row to get the next row with class 0. I believe it is not efficient for big dataframes.
What would be a more efficient way to calculate this?
Upvotes: 2
Views: 1227
Reputation: 59549
Not too much different from @Datanovice.
Use where
to NaN
the time for df['class'] == 1
, then bfill
to get the first df['class'] == 0
value. This Series gets the correct 'time' to subtract regardless of class so we can do a normal subtraction.
df['Diff'] = df['time'].where(df['class'].eq(0)).bfill() - df['time']
class time Diff
0 1 101 6.0
1 1 104 3.0
2 0 107 0.0
3 0 110 0.0
4 0 123 0.0
5 1 156 11.0
6 0 167 0.0
The Series created in the first step is:
df['time'].where(df['class'].eq(0)).bfill()
#0 107.0
#1 107.0
#2 107.0
#3 110.0
#4 123.0
#5 167.0
#6 167.0
#Name: time, dtype: float64
Upvotes: 4
Reputation: 23099
IIUC, you can chain boolean expressions to vectorise this.
First we find the first 0 after each group of 1
s
t = df[df['class'].ne(df['class'].shift()) & df['class'].eq(0)]['time']
print(t)
2 107
6 167
Name: time, dtype: int64
#then we assign a column and back fill it.
df = df.assign(Diff = t).bfill()
and finally the conditional to do our sum and calculate the 0 Diff
values.
df['Diff'] = np.where(df['class'].eq(1),df['Diff'] - df['time'],0)
print(df)
class time Diff
0 1 101 6.0
1 1 104 3.0
2 0 107 0.0
3 0 110 0.0
4 0 123 0.0
5 1 156 11.0
6 0 167 0.0
Upvotes: 2