Efficiently check if value is present in any of given ranges

Question

I have two pandas DataFrame objects:

A contains 'start' and 'finish' columns
B has column 'date'

The goal is to efficiently create a boolean mask indicating if date is in [start, finish] interval

Naive iterating taking too much time, I guess there is a method to do that faster

UPDATE: A and B have different number of rows

UPDATE2: Sample:

A
    | start     | finish    |
    |-------    |--------   |
    | 1         | 3         |
    | 50        | 83        |
    | 30        | 42        |

B
    | date      | 
    |-------    |
    | 31        | 
    | 20        | 
    | 2.5       |
    | 84        |
    | 1000      |

Output:
            | in_interval | 
            |-------    |
            | True      | 
            | False     | 
            | True      |
            | False     |
            | False     |

P.S. I have my data in datetime format but I guess that the solution will not differ from one for numbers

Guillaume Thomas · Accepted Answer

You can do it with a O(n) complexity. The idea is to transform the representation. In A, you store one row per interval. I would suggest a dataframe which stores one row per transition (ie entering an interval, leaving an interval).

A = pd.DataFrame(
    data={
        'start': [1, 50, 30],
        'finish': [3, 83, 42]    
    }
)

starts = pd.DataFrame(data={'start': 1}, index=A.start.tolist())
finishs = pd.DataFrame(data={'finish': -1}, index=A.finish.tolist())
transitions = pd.merge(starts, finishs, how='outer', left_index=True, right_index=True).fillna(0)
transitions

    start  finish
1       1       0
3       0      -1
30      1       0
42      0      -1
50      1       0
83      0      -1

this dataframe stores per date the type of transitions. Now, we need to know at each date if we are in an interval or not. It looks like counting the opening & closing parenthesis. You can do:

transitions['transition'] = (transitions.pop('finish') + transitions.pop('start')).cumsum()
transitions

    transition
1            1
3            0
30           1
42           0
50           1
83           0

Here it says:

At 1, i'm in an interval
At 3, i'm not
In general, if the value is strictly greater than 0, it's in an interval.
Note that this handles overlapping interval

And now you merge with your B dataframe:

B = pd.DataFrame(
    index=[31, 20, 2.5, 84, 1000]
)

pd.merge(transitions, B, how='outer', left_index=True, right_index=True).fillna(method='ffill').loc[B.index].astype(bool)

       transition
31.0         True
20.0        False
2.5          True
84.0        False
1000.0      False

Efficiently check if value is present in any of given ranges

Answers (2)

Related Questions