lostsoul29
lostsoul29

Reputation: 756

Find overlapping timestamps using pandas

I have a dataframe containing start times, end times and transaction_ids like so:

    tid       starttime        endtime
0   0.0     1537204247.00   1537204309.00
1   1.0     1537204248.00   1537204309.00
2   21.0    1537207170.00   1537207196.00

I need to find overlapping transactions. So far, the most optimized code I've been able to produce is the following:

p['overlap'] = False # This is my original dataframe

def compute_overlaps(df):
    for i, row_curr in df.iterrows():
        if( p.loc[row_curr['ix']]['overlap'] != True ):
            overlap_indexes = df[(row_curr['ix'] != df['ix']) & (row_curr['starttime'] < df['endtime']) & (df['starttime'] < row_curr['endtime'])].index
            p['overlap'].loc[row_curr['ix']] = True
            p['overlap'].loc[overlap_indexes] = True

<p_grouped_by_something>.apply(compute_overlaps)

Output:

    tid       starttime        endtime     overlap
0   0.0     1537204247.00   1537204309.00   True
1   1.0     1537204248.00   1537204309.00   True
2   21.0    1537207170.00   1537207196.00   False

Note that for each transaction, I merely need to determine if it overlaps with at most one other transaction. If one is found, I don't need to check all other transactions; I can stop there and mark it as overlapping.

Initially, I had a nested for loop using iterrows that was abominably slow. I was then able to vectorize the inner loop, but the outer loop remains. Is there any way to vectorize the overall computation to make it run faster?

Upvotes: 1

Views: 930

Answers (1)

BENY
BENY

Reputation: 323226

You can using numpy boradcast

s1=df.starttime.values
s2=df.endtime.values
sum(np.minimum(s2[:,None],s2)-np.maximum(s1[:,None],s1)>0)>1
Out[36]: array([ True,  True, False])

Explanation :

1st : over lap for range

(x1,y1) with (x2,y2) min(y2,y1)-max(x1,x2)>0 then two ranges have overlap

2nd : why it need great than 2 , since I using numpy braod cast , so the diagonal always represented itself comparison . Then we need greater two .

Update :

Assuming you have df and split df1 ....dfn (look at np.split)

s1=df.starttime.values
s2=df.endtime.values
l=[df1,df2,df3,df4,df5...]
n=[]
for  x in l: 
       n.append(sum(np.minimum(s2[:,None],x.values)-np.maximum(s1[:,None],x.values)>0)>1)

Upvotes: 1

Related Questions