akilat90
akilat90

Reputation: 5696

Comparing a column's value with an array (or a series) of decreasing size

I have the following dataframe. (This isn't necessarily a dataframe; a solution on the numpy array df.values would also be sufficient)

np.random.seed(42)
df = pd.DataFrame(np.random.random((10,2)),columns=['a', 'b'])
df

        a           b
0   0.374540    0.950714
1   0.731994    0.598658
2   0.156019    0.155995
3   0.058084    0.866176
4   0.601115    0.708073
5   0.020584    0.969910
6   0.832443    0.212339
7   0.181825    0.183405
8   0.304242    0.524756
9   0.431945    0.291229

I want to include a new column that has the value as per the below logic:

True : If any of the b values after a particular a value is greater than that partiulcar a value False : Otherwise

The expected output would be: (See the explanation for some of the rows below)

       a           b      c
0   0.374540    0.950714  True
1   0.731994    0.598658  True
2   0.156019    0.155995  True
3   0.058084    0.866176  True   <- np.any(0.058084 < np.array([0.708073, 0.969910, 0.212339, 0.183405, 0.524756, 0.291229]))
4   0.601115    0.708073  True   <- np.any(0.601115 < np.array([0.969910, 0.212339, 0.183405, 0.524756, 0.291229]))
5   0.020584    0.969910  True   <- np.any(0.020584 < np.array([0.212339, 0.183405, 0.524756, 0.291229]))
6   0.832443    0.212339  False  <- np.any(0.832443 < np.array([0.183405, 0.524756, 0.291229]))
7   0.181825    0.183405  True   <- np.any(0.181825 < np.array([0.524756, 0.291229]))
8   0.304242    0.524756  False  <- np.any(0.304242  < np.array([0.291229]))
9   0.431945    0.291229  UNDEFINED <- Ignore this

The above should be possible with a for loop but what is the pandas/numpy way to do that?

I was trying for an approach where I apply a lambda function to a but l couldn't find a way to get the index of the respective a value to do the np.any comparison as shown above. (I have later discovered that apply is just syntactic sugar for a for loop, though)

df['c'] = df['a'].apply(lambda x: np.any(x < df['b'].values[<i>:])) # Where <i> is the respective index value of x; which I didn't know how to find

Upvotes: 1

Views: 1745

Answers (2)

akilat90
akilat90

Reputation: 5696

To complement the answer by @Divakar, the pandas approach using cummax() would be:

df['c'] = df['a'] < df['b'][::-1].cummax()[::-1].reset_index(drop=True).shift(-1)

print(df)  

        a         b      c
0  0.374540  0.950714   True
1  0.731994  0.598658   True
2  0.156019  0.155995   True
3  0.058084  0.866176   True
4  0.601115  0.708073   True
5  0.020584  0.969910   True
6  0.832443  0.212339  False
7  0.181825  0.183405   True
8  0.304242  0.524756  False
9  0.431945  0.291229  False

Upvotes: 1

Divakar
Divakar

Reputation: 221624

The trick would be to go from bottom up on b and look for accumulated maximum values and compare those against the corresponding values in a.

Hence, the implementation would be -

a = df.a.values
b = df.b.values
out = a[:-1] < np.maximum.accumulate(b[::-1])[::-1][1:]

Porting over to pandas, the counterpart would be df.cummax for np.maximum.accumulate.

Sample run -

In [45]: df
Out[45]: 
          a         b
0  0.374540  0.950714
1  0.731994  0.598658
2  0.156019  0.155995
3  0.058084  0.866176
4  0.601115  0.708073
5  0.020584  0.969910
6  0.832443  0.212339
7  0.181825  0.183405
8  0.304242  0.524756
9  0.431945  0.291229

In [46]: out
Out[46]: array([ True,  True,  True,  True,  True,  True, False,  True, False], dtype=bool)

Upvotes: 2

Related Questions