MaxPY
MaxPY

Reputation: 1366

Efficiently check if value is present in any of given ranges

I have two pandas DataFrame objects:

The goal is to efficiently create a boolean mask indicating if date is in [start, finish] interval

Naive iterating taking too much time, I guess there is a method to do that faster

UPDATE: A and B have different number of rows

UPDATE2: Sample:

A
    | start     | finish    |
    |-------    |--------   |
    | 1         | 3         |
    | 50        | 83        |
    | 30        | 42        |

B
    | date      | 
    |-------    |
    | 31        | 
    | 20        | 
    | 2.5       |
    | 84        |
    | 1000      |

Output:
            | in_interval | 
            |-------    |
            | True      | 
            | False     | 
            | True      |
            | False     |
            | False     |

P.S. I have my data in datetime format but I guess that the solution will not differ from one for numbers

Upvotes: 4

Views: 1614

Answers (2)

ysearka
ysearka

Reputation: 3855

IIUC you want the output to be True if there is at least one interval in which the date is?

Is an apply(lambda) efficient enough for you? (It might be a little long for a big dataframe as it iterates over the rows of B). If it is, you can try this:

def in_range(date,start,finish):
    return (True in ((start < date) & (date < finish)).unique())

B.date.apply(lambda x: in_range(x,A.start,A.finish))

Output:

0     True
1    False
2     True
3    False
4    False

EDIT: MaxU's answer works better in fact. Here are the timers for 10 000 rows dataframes (A and B):

%timeit B2.date.apply(lambda x: in_range(x,A2.start,A2.finish))
1 loop, best of 3: 9.82 s per loop

%timeit B2.date.apply(lambda x: ((x >= A2.start) & (x <= A2.finish)).any())
1 loop, best of 3: 7.31 s per loop

Upvotes: 1

Guillaume Thomas
Guillaume Thomas

Reputation: 2310

You can do it with a O(n) complexity. The idea is to transform the representation. In A, you store one row per interval. I would suggest a dataframe which stores one row per transition (ie entering an interval, leaving an interval).

A = pd.DataFrame(
    data={
        'start': [1, 50, 30],
        'finish': [3, 83, 42]    
    }
)

starts = pd.DataFrame(data={'start': 1}, index=A.start.tolist())
finishs = pd.DataFrame(data={'finish': -1}, index=A.finish.tolist())
transitions = pd.merge(starts, finishs, how='outer', left_index=True, right_index=True).fillna(0)
transitions

    start  finish
1       1       0
3       0      -1
30      1       0
42      0      -1
50      1       0
83      0      -1

this dataframe stores per date the type of transitions. Now, we need to know at each date if we are in an interval or not. It looks like counting the opening & closing parenthesis. You can do:

transitions['transition'] = (transitions.pop('finish') + transitions.pop('start')).cumsum()
transitions

    transition
1            1
3            0
30           1
42           0
50           1
83           0

Here it says:

  • At 1, i'm in an interval
  • At 3, i'm not
  • In general, if the value is strictly greater than 0, it's in an interval.
  • Note that this handles overlapping interval

And now you merge with your B dataframe:

B = pd.DataFrame(
    index=[31, 20, 2.5, 84, 1000]
)

pd.merge(transitions, B, how='outer', left_index=True, right_index=True).fillna(method='ffill').loc[B.index].astype(bool)

       transition
31.0         True
20.0        False
2.5          True
84.0        False
1000.0      False

Upvotes: 3

Related Questions