Reputation: 15752
Let's assume we have a table like the one below:
A B
1 1.0
2 2.0
3 2.0
4 3.0
5 2.0
6 1.0
7 1.0
Now I want to get for each row the value from column A of the next following row for which B <= 2.0. The result is stored in C. Then we get:
A B C
1 1.0 2
2 2.0 3 # Here we skip a row because next.B > 2.0
3 2.0 5
4 3.0 5
5 2.0 6
6 1.0 7
7 1.0 Na
Is there a way to implement this efficiently in Pandas (or Numpy)? The data frame may contain multiple million rows and I hope that this operation takes at most a few seconds.
If there is no fast Pandas/Numpy solution, I will just code it in Numba. However, for some reason, my Numba solutions in the past to similar problems (nopython & nested for & break) were rather slow, which is why I am asking for a better approach.
Context: Here I asked how I can get for each row in a time series data frame a value from the next row before a delay expires. This question is related, but does not use time/a sorted column and therefore searchsorted
cannot be used.
Upvotes: 1
Views: 127
Reputation: 25259
You just need slicing df
on B
less than or equal 2
and reindex
and bfill
and shift
df['C'] = df.loc[df.B.le(2), 'A'].reindex(df.index).bfill().shift(-1)
Out[599]:
A B C
0 1 1.0 2.0
1 2 2.0 3.0
2 3 2.0 5.0
3 4 3.0 5.0
4 5 2.0 6.0
5 6 1.0 7.0
6 7 1.0 NaN
Upvotes: 0
Reputation: 4521
You can do that in just a few steps as follows:
import pandas as pd
import numpy as np
# initialize column 'C' with the value of column 'A'
# for all rows with values for 'B' smaller than 2.0
# use np.NaN if 'C' if 'B' > 2.0
# because normal int columns do not support null values
# we use the new type Int64 instead
# (new in pandas version 0.25)
df['C']= df['A'].astype('Int64').where(df['B']<=2.0, np.NaN)
# now just fill the gaps using the value of the next row
# in which the field is filled and shift the column
df['C'].fillna(method='bfill', inplace=True)
df['C']=df['C'].shift(-1)
This results in:
>>> df
A B C
0 1 1.0 2
1 2 2.0 3
2 3 2.0 5
3 4 3.0 5
4 5 2.0 6
5 6 1.0 7
6 7 1.0 NaN
Upvotes: 2