dleal
dleal

Reputation: 2314

compare whether value in one column is between two values in another column python pandas

I have two data frames as follow:

A = pd.DataFrame({"value":[3, 7, 5 ,18,23,27,21,29]})

B = pd.DataFrame({"low":[1, 6, 11 ,16,21,26], "high":[5,10,15,20,25,30], "name":["one","two","three","four","five", "six"]})

I want to find whether "value" in A is between 'high' and low' in B, and if so, I want to copy the column name from B to A.

The output should look like this:

A = pd.DataFrame({"value":[3, 7, 5 ,18,23,27,21,29], "name":["one","two","one","four","five", "six", "five", "six"]})

My function uses iterrows as follows:

def func1(row):
    x = row['value']
    for index,value in B.iterrows():
        if ((value['low'] <= x) &(x<=value['high'])):
            return value['name']

But it doesn't yet achieve what i want to do,

thank you,

Upvotes: 3

Views: 2745

Answers (2)

Alexander
Alexander

Reputation: 109546

You can use a list comprehension to iterate through the values in A, and then use loc to get the relevant mapped values. le is less than or equal to, and ge is greater than or equal to.

For example, v = 3 in the first row. Using simple boolean indexing:

>>> B[(B['low'].le(v)) & (B['high'].ge(v))]
   high  low name
0     5    1  one

Assuming that DataFrame B does not have any overlapping ranges, then you will get back one row as above. One then uses loc to get the name column, as below. Because each returned name is a a series, you need get the first and only scalar value (using iat, for example).

A['name'] = [B.loc[(B['low'].le(v)) & (B['high'].ge(v)), 'name'].iat[0] 
             for v in A['value']]

>>> A
   value  name
0      3   one
1      7   two
2      5   one
3     18  four
4     23  five
5     27   six
6     21  five
7     29   six

Upvotes: 4

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 95948

I believe you are looking for something like this:

In [1]: import pandas as pd

In [2]: A = pd.DataFrame({"value":[3, 7, 5 ,18,23,27,21,29]})

In [3]: 

In [3]: B = pd.DataFrame({"low":[1, 6, 11 ,16,21,26], "high":[5,10,15,20,25,30], "name":["one","two","three","four","five", "six"]})

In [4]: A
Out[4]: 
   value
0      3
1      7
2      5
3     18
4     23
5     27
6     21
7     29

In [5]: B
Out[5]: 
   high  low   name
0     5    1    one
1    10    6    two
2    15   11  three
3    20   16   four
4    25   21   five
5    30   26    six

In [6]: def func1(x):
   ...:     for row in B.itertuples():
   ...:         if row.low <= x <= row.high:
   ...:             return row.name
   ...:         

In [7]: A.value.map(func1)
Out[7]: 
0     one
1     two
2     one
3    four
4    five
5     six
6    five
7     six
Name: value, dtype: object

In [8]: A['name'] = A['value'].map(func1)

In [9]: A
Out[9]: 
   value  name
0      3   one
1      7   two
2      5   one
3     18  four
4     23  five
5     27   six
6     21  five
7     29   six

I use itertuples because it should be a little bit faster but in general this will not be very efficient. This is a solution but there might be better ones.

Edited to Add:

In [8]: timeit A['value'].map(func1)
100 loops, best of 3: 10.5 ms per loop

In [9]: timeit [B.loc[(B['low'].le(v)) & (B['high'].ge(v)), 'name'].tolist()[0] for v in A['value']]
100 loops, best of 3: 9.06 ms per loop

Quick and dirty test shows that Alexander's approach is faster. I wonder how it scales.

Upvotes: 1

Related Questions