Pete
Pete

Reputation: 15

Compare all values in one column with all values in another column and return indexes

I am interested in comparing all values from 1 dataframe column with all values from a 2nd column and then generating a list or a subset df with values from a 3rd column that is adjacent to the 1st column hits. Hopefully this example will explain it better:

For a simplified example, say I generate the following pandas dataframe:

fake_df=pd.DataFrame({'m':[100,120,101,200,201,501,350,420,525,500],
                  'n':[10.0,11.0,10.2,1.0,2.0,1.1,3.0,1.0,2.0,1.0],
                  'mod':[101.001,121.001,102.001,201.001,202.001,502.001,351.001,421.001,526.001,501.001]})
print fake_df

What I am interested in doing is finding all values in column 'm' that are within 0.1 of any value in column 'mod' and return the values in column 'n' that correspond to the column 'm' hits. So for the above code, the return would be: 10.2, 2.0, 1.1 (since 101,201 and 501 all have close hits in column 'mod').

I have found ways to compare across the same row, but not like above. Is there a way to do this in pandas without extensive loops? Thanks!

Upvotes: 0

Views: 1294

Answers (3)

Kracit
Kracit

Reputation: 1728

I'll be making of numpy (imported as np) which pandas uses under the hood. np.isclose returns a boolean indexer: for each value of the iterable, there's a True or False value corresponding to the value m being within atol of each value of df["mod"].

>>> for i, m in df["m"].iteritems():
...     indices = np.isclose(m, df["mod"], atol=0.1)
...     if any(indices):
...         print df["n"][i]

Using the DataFrame you gave produces the output:

10.2
2.0
1.1

Upvotes: 1

immerrr
immerrr

Reputation: 1273

I don't know such method in pandas, but when you extend your scope to include numpy, two options come to mind.

Easy/Expensive Method

If you can live with N**2 memory overhead, you can do numpy broadcasting to find out all "adjacent" elements in one step:

In [25]: fake_df=pd.DataFrame({'m':[100,120,101,200,201,501,350,420,525,500],
                  'n':[10.0,11.0,10.2,1.0,2.0,1.1,3.0,1.0,2.0,1.0],
                  'mod':[101.001,121.001,102.001,201.001,202.001,502.001,351.001,421.001,526.001,501.001]})

In [26]: mvals = fake_df['m'].values

In [27]: modvals = fake_df['mod'].values

In [28]: is_close = np.abs(mvals - modvals[:, np.newaxis]) <= 0.1; is_close.astype(int)
Out[28]: 
array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]])

As we care only about 'mod' values that have adjacent 'm's, aggregate over axis=0:

In [29]: is_close.any(axis=0).astype(int)
Out[29]: array([0, 0, 1, 0, 1, 1, 0, 0, 0, 0])

Or otherwise

In [30]: fake_df.ix[is_close.any(axis=0), 'n']
Out[30]: 
2    10.2
4     2.0
5     1.1
Name: n, dtype: float64

Efficient/Complex Method

To find adjacent elements in less than O(N**2) without any hashing/rounding tricks, you have to do some sorting:

In [103]: modvals_sorted = np.sort(modvals)

In [104]: next_indices = np.searchsorted(modvals_sorted, mvals)

You have indices of next elements, but they may point beyond the original array, so you need a one extra NaN at the end to avoid IndexError. Same logic applies to previous elements which are next_indices - 1: to avoid indexing before the first element we must prepend one NaN, too. Note the + 1 that arises because one of NaN has been added to the beginning.

In [105]: modvals_sorted_plus = np.r_[np.nan, modvals_sorted, np.nan]

In [106]: nexts = modvals_sorted_plus[next_indices + 1]

In [107]: prevs = modvals_sorted_plus[(next_indices - 1) + 1]

Now it's trivial. Note that we already have prevs <= mvals <= nexts, so we don't need to use np.abs. Also, all missing elements are NaN and comparing with them results in False that doesn't alter the result of any operation.

In [108]: adjacent = np.c_[prevs, mvals, nexts]; adjacent
Out[108]: 
array([[     nan,  100.   ,  101.001],
       [ 102.001,  120.   ,  121.001],
       [     nan,  101.   ,  101.001],
       [ 121.001,  200.   ,  201.001],
       [ 121.001,  201.   ,  201.001],
       [ 421.001,  501.   ,  501.001],
       [ 202.001,  350.   ,  351.001],
       [ 351.001,  420.   ,  421.001],
       [ 502.001,  525.   ,  526.001],
       [ 421.001,  500.   ,  501.001]])

In [109]: (np.diff(adjacent, axis=1) <= 0.1).any(axis=1)
Out[109]: array([False, False,  True, False,  True,  True, False, False, False, False], dtype=bool)

In [110]: mask = (np.diff(adjacent, axis=1) <= 0.1).any(axis=1)

In [112]: fake_df.ix[mask, 'n']
Out[112]: 
2    10.2
4     2.0
5     1.1
Name: n, dtype: float64

Upvotes: 2

jkalden
jkalden

Reputation: 1588

Try the following:

# I assume all arrays involved to be or to be converted to numpy arrays
import numpy as np
m = np.array([100,120,101,200,201,501,350,420,525,500])
n = np.array([10.0,11.0,10.2,1.0,2.0,1.1,3.0,1.0,2.0,1.0])
mod = np.array([101.001,121.001,102.001,201.001,202.001,502.001,351.001,421.001,526.001,501.001])

res = []
# for each entry in mod, look in m for "close" values
for i in range(len(mod)):
    # for each hit, store entry from n in result list
    res.extend(n[np.fabs(mod[i]-m)<=0.1])
# cast result to numpy array
res = np.array(res)
print res

The output is

[ 10.2   2.    1.1]

Upvotes: 1

Related Questions