Pandas np.where with matching range of values on a row

Question

Test data:

In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame(
       {'AAA' : [4,5,6,7,9,10], 
        'BBB' : [10,20,30,40,11,10],
        'CCC' : [100,50,25,10,10,11]});
In [2]:df
Out[2]:
   AAA  BBB  CCC
0    4   10  100
1    5   20   50
2    6   30   25
3    7   40   10
4    9   11   10
5   10   10   11

In [3]: thresh = 2
        df['aligned'] = np.where(df.AAA == df.BBB,max(df.AAA)|(df.BBB),np.nan)

The following np.where statement provides max(df.AAA or df.BBB) when df.AAA and df.BBB are exactly aligned. I would like to have the max when the columns are within the value in thresh and also consider all columns. It does not have to be via np.where. What ways are there to approach this?

So for row 5 it should be 11.0 in df.aligned as this is the max value and within thresh of df.AAA and df.BBB.

Ultimately I am looking for ways to find levels across multiple columns where the values are closely aligned.

Current Output with my code:

df
   AAA  BBB CCC aligned
0   4   10  100 NaN
1   5   20  50  NaN
2   6   30  25  NaN
3   7   40  10  NaN
4   9   11  10  NaN
5   10  10  11  10.0

Desired Output:

df
   AAA  BBB CCC aligned
0   4   10  100 NaN
1   5   20  50  NaN
2   6   30  25  NaN
3   7   40  10  NaN
4   9   11  10  11.0
5   10  10  11  11.0

The desired output shows rows 4 and 5 with values on df.aligned. As these have values within thresh of each other (values 10 and 11 are within the range specified in thresh variable).

Julien Marrec · Accepted Answer

"Within thresh distance" to me means that the difference between the max and the min of a row should be less than thresh. We can use DataFrame.apply with parameter axis=1 so that we apply the lambda function on each row.

In [1]: filt_thresh = df.apply(lambda x: (x.max() - x.min())



Alternatively there's a faster solution as pointed out below by @root:

filt_thresh = np.ptp(df.values, axis=1) < tresh
10000 loops, best of 3: 48.9 µs per loop


Or, staying with pandas:

filt_thresh = df.max(axis=1) - df.min(axis=1) < thresh
1000 loops, best of 3: 943 µs per loop


We can now use boolean indexing and calculate the max of each row that matches (hence the axis=1 parameter in max()again):

In [2]:  df.loc[filt_thresh, 'aligned'] = df[filt_thresh].max(axis=1)

In [3]: df
Out[3]: 
   AAA  BBB  CCC  aligned
0    4   10  100      NaN
1    5   20   50      NaN
2    6   30   25      NaN
3    7   40   10      NaN
4    9   11   10      NaN
5   10   10   11     11.0




Update:

If you wanted to calculate the minimum distance between elements for each row, that'd be equivalent to sorting the array of values (np.sort()), calculating the difference between consecutive numbers (np.diff), and taking the min of the resulting array. Finally, compare that to tresh.

Here's the apply way that has the advantage of being a bit clearer to understand.

filt_thresh = df.apply(lambda row: np.min(np.diff(np.sort(row))) < thresh, axis=1)

1000 loops, best of 3: 713 µs per loop


And here's the vectorized equivalent:

filt_thresh = np.diff(np.sort(df)).min(axis=1) < thresh

The slowest run took 4.31 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000 loops, best of 3: 67.3 µs per loop

Pandas np.where with matching range of values on a row

Answers (1)

Update:

Related Questions