I have a dataframe with 2 columns: class (0/1) and time (integer). I need to append a third column which will be the remaining time to get a class 1 row. df = pd.DataFrame([ [1,101], [1,104], [0,107], [0,110], [0,123], [1,156], [0,167]], columns=['class', 'time']) If a row has class 0; diff should be 0. If a row has class 1; diff should be the difference between its time and time of the first upcoming row that has class 0. I can calculate it in a Lambda function: df['diff'] = df.apply(lambda x: df[ (df['time'] >= x[1]) & (df['class']==0)]['time'].iloc[0] - x[1], axis=1) The expression df[ (df['time'] >= x[1]) & (df['class']==0)] is run for every row to get the next row with class 0. I believe it is not efficient for big dataframes. What would be a more efficient way to calculate this?

Reputation: 3905

Efficient way of row-based calculation in Pandas

I have a dataframe with 2 columns: class (0/1) and time (integer). I need to append a third column which will be the remaining time to get a class 1 row.

df = pd.DataFrame([
    [1,101], [1,104],
    [0,107], [0,110], [0,123],
    [1,156],
    [0,167]],
  columns=['class', 'time'])

If a row has class 0; diff should be 0.
If a row has class 1; diff should be the difference between its time and time of the first upcoming row that has class 0.

I can calculate it in a Lambda function:

df['diff'] = df.apply(lambda x: df[ (df['time'] >= x[1]) & (df['class']==0)]['time'].iloc[0] - x[1], axis=1)

The expression df[ (df['time'] >= x[1]) & (df['class']==0)] is run for every row to get the next row with class 0. I believe it is not efficient for big dataframes.

What would be a more efficient way to calculate this?

Upvotes: 2

Answers (2)

ALollz

Reputation: 59549

Not too much different from @Datanovice.

Use where to NaN the time for df['class'] == 1, then bfill to get the first df['class'] == 0 value. This Series gets the correct 'time' to subtract regardless of class so we can do a normal subtraction.

df['Diff'] = df['time'].where(df['class'].eq(0)).bfill() - df['time']

   class  time  Diff
0      1   101   6.0
1      1   104   3.0
2      0   107   0.0
3      0   110   0.0
4      0   123   0.0
5      1   156  11.0
6      0   167   0.0

The Series created in the first step is:

df['time'].where(df['class'].eq(0)).bfill()
#0    107.0
#1    107.0
#2    107.0
#3    110.0
#4    123.0
#5    167.0
#6    167.0
#Name: time, dtype: float64

Upvotes: 4

Umar.H

Reputation: 23099

IIUC, you can chain boolean expressions to vectorise this.

First we find the first 0 after each group of 1s

t = df[df['class'].ne(df['class'].shift()) & df['class'].eq(0)]['time']

print(t)
2    107
6    167
Name: time, dtype: int64

#then we assign a column and back fill it.

df = df.assign(Diff = t).bfill()

and finally the conditional to do our sum and calculate the 0 Diff values.

df['Diff'] = np.where(df['class'].eq(1),df['Diff'] - df['time'],0)


print(df)

   class  time  Diff
0      1   101   6.0
1      1   104   3.0
2      0   107   0.0
3      0   110   0.0
4      0   123   0.0
5      1   156  11.0
6      0   167   0.0

Upvotes: 2

Efficient way of row-based calculation in Pandas

Answers (2)

Related Questions