Reputation: 1366
I have two pandas DataFrame objects:
A
contains 'start'
and 'finish'
columns
B
has column 'date'
The goal is to efficiently create a boolean mask indicating if date
is in [start, finish]
interval
Naive iterating taking too much time, I guess there is a method to do that faster
UPDATE:
A
and B
have different number of rows
UPDATE2: Sample:
A
| start | finish |
|------- |-------- |
| 1 | 3 |
| 50 | 83 |
| 30 | 42 |
B
| date |
|------- |
| 31 |
| 20 |
| 2.5 |
| 84 |
| 1000 |
Output:
| in_interval |
|------- |
| True |
| False |
| True |
| False |
| False |
P.S. I have my data in datetime format but I guess that the solution will not differ from one for numbers
Upvotes: 4
Views: 1614
Reputation: 3855
IIUC you want the output to be True
if there is at least one interval in which the date is?
Is an apply(lambda)
efficient enough for you? (It might be a little long for a big dataframe as it iterates over the rows of B
). If it is, you can try this:
def in_range(date,start,finish):
return (True in ((start < date) & (date < finish)).unique())
B.date.apply(lambda x: in_range(x,A.start,A.finish))
Output:
0 True
1 False
2 True
3 False
4 False
EDIT: MaxU's answer works better in fact. Here are the timers for 10 000 rows dataframes (A and B):
%timeit B2.date.apply(lambda x: in_range(x,A2.start,A2.finish))
1 loop, best of 3: 9.82 s per loop
%timeit B2.date.apply(lambda x: ((x >= A2.start) & (x <= A2.finish)).any())
1 loop, best of 3: 7.31 s per loop
Upvotes: 1
Reputation: 2310
You can do it with a O(n) complexity. The idea is to transform the representation. In A, you store one row per interval. I would suggest a dataframe which stores one row per transition (ie entering an interval, leaving an interval).
A = pd.DataFrame(
data={
'start': [1, 50, 30],
'finish': [3, 83, 42]
}
)
starts = pd.DataFrame(data={'start': 1}, index=A.start.tolist())
finishs = pd.DataFrame(data={'finish': -1}, index=A.finish.tolist())
transitions = pd.merge(starts, finishs, how='outer', left_index=True, right_index=True).fillna(0)
transitions
start finish
1 1 0
3 0 -1
30 1 0
42 0 -1
50 1 0
83 0 -1
this dataframe stores per date the type of transitions. Now, we need to know at each date if we are in an interval or not. It looks like counting the opening & closing parenthesis. You can do:
transitions['transition'] = (transitions.pop('finish') + transitions.pop('start')).cumsum()
transitions
transition
1 1
3 0
30 1
42 0
50 1
83 0
Here it says:
And now you merge with your B dataframe:
B = pd.DataFrame(
index=[31, 20, 2.5, 84, 1000]
)
pd.merge(transitions, B, how='outer', left_index=True, right_index=True).fillna(method='ffill').loc[B.index].astype(bool)
transition
31.0 True
20.0 False
2.5 True
84.0 False
1000.0 False
Upvotes: 3