Reputation: 5696
I have the following dataframe. (This isn't necessarily a dataframe; a solution on the numpy array df.values
would also be sufficient)
np.random.seed(42)
df = pd.DataFrame(np.random.random((10,2)),columns=['a', 'b'])
df
a b
0 0.374540 0.950714
1 0.731994 0.598658
2 0.156019 0.155995
3 0.058084 0.866176
4 0.601115 0.708073
5 0.020584 0.969910
6 0.832443 0.212339
7 0.181825 0.183405
8 0.304242 0.524756
9 0.431945 0.291229
I want to include a new column that has the value as per the below logic:
True : If any of the b
values after a particular a
value is greater than that partiulcar a
value
False : Otherwise
The expected output would be: (See the explanation for some of the rows below)
a b c
0 0.374540 0.950714 True
1 0.731994 0.598658 True
2 0.156019 0.155995 True
3 0.058084 0.866176 True <- np.any(0.058084 < np.array([0.708073, 0.969910, 0.212339, 0.183405, 0.524756, 0.291229]))
4 0.601115 0.708073 True <- np.any(0.601115 < np.array([0.969910, 0.212339, 0.183405, 0.524756, 0.291229]))
5 0.020584 0.969910 True <- np.any(0.020584 < np.array([0.212339, 0.183405, 0.524756, 0.291229]))
6 0.832443 0.212339 False <- np.any(0.832443 < np.array([0.183405, 0.524756, 0.291229]))
7 0.181825 0.183405 True <- np.any(0.181825 < np.array([0.524756, 0.291229]))
8 0.304242 0.524756 False <- np.any(0.304242 < np.array([0.291229]))
9 0.431945 0.291229 UNDEFINED <- Ignore this
The above should be possible with a for loop but what is the pandas/numpy way to do that?
I was trying for an approach where I apply a lambda function to a
but l couldn't find a way to get the index of the respective a
value to do the np.any
comparison as shown above. (I have later discovered that apply
is just syntactic sugar for a for loop, though)
df['c'] = df['a'].apply(lambda x: np.any(x < df['b'].values[<i>:])) # Where <i> is the respective index value of x; which I didn't know how to find
Upvotes: 1
Views: 1745
Reputation: 5696
To complement the answer by @Divakar, the pandas approach using cummax()
would be:
df['c'] = df['a'] < df['b'][::-1].cummax()[::-1].reset_index(drop=True).shift(-1)
print(df)
a b c
0 0.374540 0.950714 True
1 0.731994 0.598658 True
2 0.156019 0.155995 True
3 0.058084 0.866176 True
4 0.601115 0.708073 True
5 0.020584 0.969910 True
6 0.832443 0.212339 False
7 0.181825 0.183405 True
8 0.304242 0.524756 False
9 0.431945 0.291229 False
Upvotes: 1
Reputation: 221624
The trick would be to go from bottom up on b
and look for accumulated maximum values and compare those against the corresponding values in a
.
Hence, the implementation would be -
a = df.a.values
b = df.b.values
out = a[:-1] < np.maximum.accumulate(b[::-1])[::-1][1:]
Porting over to pandas
, the counterpart would be df.cummax
for np.maximum.accumulate
.
Sample run -
In [45]: df
Out[45]:
a b
0 0.374540 0.950714
1 0.731994 0.598658
2 0.156019 0.155995
3 0.058084 0.866176
4 0.601115 0.708073
5 0.020584 0.969910
6 0.832443 0.212339
7 0.181825 0.183405
8 0.304242 0.524756
9 0.431945 0.291229
In [46]: out
Out[46]: array([ True, True, True, True, True, True, False, True, False], dtype=bool)
Upvotes: 2