MattiH
MattiH

Reputation: 654

How find count of rows between two timestamps?

I have one table, called df_1

   start_time  end_time
0           0        30
1          77       107
2         220       250
3         268       298
4         391       421

I would like to know how many rows there are in another dataframe df_2 between each start_time and end_time of df_1. df_2 looks like this

   time
0   100
1   392
2   399

So, the result would look like this

   start_time  end_time  count
0           0        30      0
1          77       107      1
2         220       250      0
3         268       298      0
4         391       421      2

Any idea how this could be done?

Upvotes: 1

Views: 482

Answers (4)

MattiH
MattiH

Reputation: 654

I found a better way that doesn't use any implicit loop.

More verbose, but much faster (20% faster than lambda solutions)

# Turn df_2 into multiple-row df where row count equals df_1
df_2 = df_2[['time']].T
df_3 = df_2.loc[df_2.index.repeat(len(df_1))].copy()
df_3.index = df_1.index  # Aligns indices

events_after_start = df_3.T > df_1['start_time']
events_before_end = df_3.T < df_1['end_time']
df_1['Count'] = (events_after_start & events_before_end).T.sum(axis=1)

Upvotes: 1

sammywemmy
sammywemmy

Reputation: 28644

There are a number of options for this; the first involves IntervalIndex:

from collections import Counter
intervals = pd.IntervalIndex.from_arrays(df1.start_time, df1.end_time, closed='both')

df1.index = intervals

filtered = intervals[intervals.get_indexer(df2.time)]

counts = Counter(filtered)

df1.loc[counts.keys(), 'count'] = [*counts.values()]

df1.fillna(0, downcast='infer').reset_index(drop=True)


   start_time  end_time  count
0           0        30      0
1          77       107      1
2         220       250      0
3         268       298      0
4         391       421      2

The other option involves conditional_join from pyjanitor, which can help to abstract inequality joins:

# pip install pyjanitor
import pandas as pd
import janitor

(df1.conditional_join(
         df2, 
        ('start_time', 'time', '<='), 
        ('end_time', 'time', '>='), 
        how = 'left')
    .groupby(['start_time', 'end_time'], as_index = False)
    .agg(count=('time', 'count'))
)

   start_time  end_time  count
0           0        30      0
1          77       107      1
2         220       250      0
3         268       298      0
4         391       421      2

The IntervalIndex option should be faster though, as the end goal is an aggregation (the intervalIndex option skips creating a dataframe)

Upvotes: 1

Tbaki
Tbaki

Reputation: 1003

On top of my head, i think

df_1["count"] = df_1.apply(lambda x: df_2.time.between(x[0],x[1]).sum(),axis=1)

Should work, not sure about the time complexity.

Output

   start_time  end_time  count
0           0        30      0
1          77       107      1
2         220       250      0
3         268       298      0
4         391       421      2

Upvotes: 1

Meti
Meti

Reputation: 2056

There are plenty of useful built-in functions for doing what you are going to do. Here I used between (its document is available here). The implementation would be like this:

df1 = pd.DataFrame({'Start':[0, 77, 391],
             'End':[30, 107, 421]})
df2 = pd.DataFrame({'time':[100, 392, 399]})

df1['Count'] = df1.apply(lambda row:df2.time.between(row['Start'], row['End']).astype(int).sum(), axis=1)
df1

output:

   Start  End  Count
0      0   30      0
1     77  107      1
2    391  421      2

Upvotes: 2

Related Questions