Reputation: 23

Pandas Cumulative Sum using Current Row as Condition

I've got a fairly large data set of about 2 million records, each of which has a start time and an end time. I'd like to insert a field into each record that counts how many records there are in the table where:

Start time is less than or equal to "this row"'s start time
AND end time is greater than "this row"'s start time

So basically each record ends up with a count of how many events, including itself, are "active" concurrently with it.

I've been trying to teach myself pandas to do this with but I am not even sure where to start looking. I can find lots of examples of summing rows that meet a given condition like "> 2", but can't seem to grasp how to iterate over rows to conditionally sum a column based on values in the current row.

Upvotes: 2

Answers (3)

supercooler8

Reputation: 503

def counter (s: pd.Series):
return ((df["start"]<= s["start"]) & (df["end"] >= s["start"])).sum()

df["count"] = df.apply(counter , axis = 1)

This feels a lot simpler approach, using the apply method. This doesn't really compromise on speed as the apply function, although not as fast as python native functions like cumsum() or cum, it should be faster than using a for loop.

Upvotes: 0

exp1orer

Reputation: 12049

Here goes. This is going to be SLOW.

Note that this counts each row as overlapping with itself, so the results column will never be 0. (Subtract 1 from the result to do it the other way.)

import pandas as pd
df = pd.DataFrame({'start_time': [4,3,1,2],'end_time': [7,5,3,8]})
df = df[['start_time','end_time']] #just changing the order of the columns for aesthetics

def overlaps_with_row(row,frame):
    starts_before_mask = frame.start_time <= row.start_time
    ends_after_mask = frame.end_time > row.start_time
    return (starts_before_mask & ends_after_mask).sum()

df['number_which_overlap'] = df.apply(overlaps_with_row,frame=df,axis=1)

Yields:

In [8]: df
Out[8]: 
   start_time  end_time  number_which_overlap
0           4         7                     3
1           3         5                     2
2           1         3                     1
3           2         8                     2

[4 rows x 3 columns]

Upvotes: 1

kimal

Reputation: 692

You can try below code to get the final result.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[2,10],[5,8],[3,8],[6,9]]),columns=["start","end"])

active_events= {}
for i in df.index:
    active_events[i] = len(df[(df["start"]<=df.loc[i,"start"]) & (df["end"]> df.loc[i,"start"])])
last_columns = pd.DataFrame({'No. active events' : pd.Series(active_events)})

df.join(last_columns)

Upvotes: 1

Pandas Cumulative Sum using Current Row as Condition

Answers (3)

Related Questions