Reputation: 5498
Test data:
In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'AAA' : [4,5,6,7,9,10],
'BBB' : [10,20,30,40,11,10],
'CCC' : [100,50,25,10,10,11]});
In [2]:df
Out[2]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 25
3 7 40 10
4 9 11 10
5 10 10 11
In [3]: thresh = 2
df['aligned'] = np.where(df.AAA == df.BBB,max(df.AAA)|(df.BBB),np.nan)
The following np.where
statement provides max(df.AAA or df.BBB)
when df.AAA
and df.BBB
are exactly aligned. I would like to have the max
when the columns are within the value in thresh
and also consider all columns. It does not have to be via np.where
. What ways are there to approach this?
So for row 5 it should be 11.0
in df.aligned
as this is the max value and within thresh
of df.AAA
and df.BBB
.
Ultimately I am looking for ways to find levels across multiple columns where the values are closely aligned.
Current Output with my code:
df
AAA BBB CCC aligned
0 4 10 100 NaN
1 5 20 50 NaN
2 6 30 25 NaN
3 7 40 10 NaN
4 9 11 10 NaN
5 10 10 11 10.0
Desired Output:
df
AAA BBB CCC aligned
0 4 10 100 NaN
1 5 20 50 NaN
2 6 30 25 NaN
3 7 40 10 NaN
4 9 11 10 11.0
5 10 10 11 11.0
The desired output shows rows 4 and 5 with values on df.aligned
. As these have values within thresh
of each other (values 10 and 11 are within the range specified in thresh
variable).
Upvotes: 1
Views: 2364
Reputation: 11905
"Within thresh distance" to me means that the difference between the max
and the min
of a row should be less than thresh
. We can use DataFrame.apply
with parameter axis=1
so that we apply the lambda function on each row.
In [1]: filt_thresh = df.apply(lambda x: (x.max() - x.min())<thresh, axis=1)
100 loops, best of 3: 1.89 ms per loop
Alternatively there's a faster solution as pointed out below by @root:
filt_thresh = np.ptp(df.values, axis=1) < tresh
10000 loops, best of 3: 48.9 µs per loop
Or, staying with pandas:
filt_thresh = df.max(axis=1) - df.min(axis=1) < thresh
1000 loops, best of 3: 943 µs per loop
We can now use boolean indexing and calculate the max of each row that matches (hence the axis=1
parameter in max()
again):
In [2]: df.loc[filt_thresh, 'aligned'] = df[filt_thresh].max(axis=1)
In [3]: df
Out[3]:
AAA BBB CCC aligned
0 4 10 100 NaN
1 5 20 50 NaN
2 6 30 25 NaN
3 7 40 10 NaN
4 9 11 10 NaN
5 10 10 11 11.0
If you wanted to calculate the minimum distance between elements for each row, that'd be equivalent to sorting the array of values (np.sort()
), calculating the difference between consecutive numbers (np.diff), and taking the min
of the resulting array. Finally, compare that to tresh
.
Here's the apply
way that has the advantage of being a bit clearer to understand.
filt_thresh = df.apply(lambda row: np.min(np.diff(np.sort(row))) < thresh, axis=1)
1000 loops, best of 3: 713 µs per loop
And here's the vectorized equivalent:
filt_thresh = np.diff(np.sort(df)).min(axis=1) < thresh
The slowest run took 4.31 times longer than the fastest.
This could mean that an intermediate result is being cached.
10000 loops, best of 3: 67.3 µs per loop
Upvotes: 3