Reputation: 15
I am interested in comparing all values from 1 dataframe column with all values from a 2nd column and then generating a list or a subset df with values from a 3rd column that is adjacent to the 1st column hits. Hopefully this example will explain it better:
For a simplified example, say I generate the following pandas dataframe:
fake_df=pd.DataFrame({'m':[100,120,101,200,201,501,350,420,525,500],
'n':[10.0,11.0,10.2,1.0,2.0,1.1,3.0,1.0,2.0,1.0],
'mod':[101.001,121.001,102.001,201.001,202.001,502.001,351.001,421.001,526.001,501.001]})
print fake_df
What I am interested in doing is finding all values in column 'm' that are within 0.1 of any value in column 'mod' and return the values in column 'n' that correspond to the column 'm' hits. So for the above code, the return would be: 10.2, 2.0, 1.1 (since 101,201 and 501 all have close hits in column 'mod').
I have found ways to compare across the same row, but not like above. Is there a way to do this in pandas without extensive loops? Thanks!
Upvotes: 0
Views: 1294
Reputation: 1728
I'll be making of numpy (imported as np) which pandas uses under the hood. np.isclose returns a boolean indexer: for each value of the iterable, there's a True
or False
value corresponding to the value m
being within atol
of each value of df["mod"]
.
>>> for i, m in df["m"].iteritems():
... indices = np.isclose(m, df["mod"], atol=0.1)
... if any(indices):
... print df["n"][i]
Using the DataFrame you gave produces the output:
10.2
2.0
1.1
Upvotes: 1
Reputation: 1273
I don't know such method in pandas, but when you extend your scope to include numpy, two options come to mind.
If you can live with N**2 memory overhead, you can do numpy broadcasting to find out all "adjacent" elements in one step:
In [25]: fake_df=pd.DataFrame({'m':[100,120,101,200,201,501,350,420,525,500],
'n':[10.0,11.0,10.2,1.0,2.0,1.1,3.0,1.0,2.0,1.0],
'mod':[101.001,121.001,102.001,201.001,202.001,502.001,351.001,421.001,526.001,501.001]})
In [26]: mvals = fake_df['m'].values
In [27]: modvals = fake_df['mod'].values
In [28]: is_close = np.abs(mvals - modvals[:, np.newaxis]) <= 0.1; is_close.astype(int)
Out[28]:
array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0]])
As we care only about 'mod' values that have adjacent 'm's, aggregate over axis=0:
In [29]: is_close.any(axis=0).astype(int)
Out[29]: array([0, 0, 1, 0, 1, 1, 0, 0, 0, 0])
Or otherwise
In [30]: fake_df.ix[is_close.any(axis=0), 'n']
Out[30]:
2 10.2
4 2.0
5 1.1
Name: n, dtype: float64
To find adjacent elements in less than O(N**2) without any hashing/rounding tricks, you have to do some sorting:
In [103]: modvals_sorted = np.sort(modvals)
In [104]: next_indices = np.searchsorted(modvals_sorted, mvals)
You have indices of next elements, but they may point beyond the original
array, so you need a one extra NaN
at the end to avoid IndexError
. Same
logic applies to previous elements which are next_indices - 1
: to avoid
indexing before the first element we must prepend one NaN
, too. Note the + 1
that arises because one of NaN
has been added to the beginning.
In [105]: modvals_sorted_plus = np.r_[np.nan, modvals_sorted, np.nan]
In [106]: nexts = modvals_sorted_plus[next_indices + 1]
In [107]: prevs = modvals_sorted_plus[(next_indices - 1) + 1]
Now it's trivial. Note that we already have prevs <= mvals <= nexts
, so we
don't need to use np.abs
. Also, all missing elements are NaN
and comparing with them results in False
that doesn't alter the result of any
operation.
In [108]: adjacent = np.c_[prevs, mvals, nexts]; adjacent
Out[108]:
array([[ nan, 100. , 101.001],
[ 102.001, 120. , 121.001],
[ nan, 101. , 101.001],
[ 121.001, 200. , 201.001],
[ 121.001, 201. , 201.001],
[ 421.001, 501. , 501.001],
[ 202.001, 350. , 351.001],
[ 351.001, 420. , 421.001],
[ 502.001, 525. , 526.001],
[ 421.001, 500. , 501.001]])
In [109]: (np.diff(adjacent, axis=1) <= 0.1).any(axis=1)
Out[109]: array([False, False, True, False, True, True, False, False, False, False], dtype=bool)
In [110]: mask = (np.diff(adjacent, axis=1) <= 0.1).any(axis=1)
In [112]: fake_df.ix[mask, 'n']
Out[112]:
2 10.2
4 2.0
5 1.1
Name: n, dtype: float64
Upvotes: 2
Reputation: 1588
Try the following:
# I assume all arrays involved to be or to be converted to numpy arrays
import numpy as np
m = np.array([100,120,101,200,201,501,350,420,525,500])
n = np.array([10.0,11.0,10.2,1.0,2.0,1.1,3.0,1.0,2.0,1.0])
mod = np.array([101.001,121.001,102.001,201.001,202.001,502.001,351.001,421.001,526.001,501.001])
res = []
# for each entry in mod, look in m for "close" values
for i in range(len(mod)):
# for each hit, store entry from n in result list
res.extend(n[np.fabs(mod[i]-m)<=0.1])
# cast result to numpy array
res = np.array(res)
print res
The output is
[ 10.2 2. 1.1]
Upvotes: 1