user3838505
user3838505

Reputation: 23

Pandas Cumulative Sum using Current Row as Condition

I've got a fairly large data set of about 2 million records, each of which has a start time and an end time. I'd like to insert a field into each record that counts how many records there are in the table where:

So basically each record ends up with a count of how many events, including itself, are "active" concurrently with it.

I've been trying to teach myself pandas to do this with but I am not even sure where to start looking. I can find lots of examples of summing rows that meet a given condition like "> 2", but can't seem to grasp how to iterate over rows to conditionally sum a column based on values in the current row.

Upvotes: 2

Views: 1837

Answers (3)

supercooler8
supercooler8

Reputation: 503

def counter (s: pd.Series):
return ((df["start"]<= s["start"]) & (df["end"] >= s["start"])).sum()

df["count"] = df.apply(counter , axis = 1)

This feels a lot simpler approach, using the apply method. This doesn't really compromise on speed as the apply function, although not as fast as python native functions like cumsum() or cum, it should be faster than using a for loop.

Upvotes: 0

exp1orer
exp1orer

Reputation: 12029

Here goes. This is going to be SLOW.

Note that this counts each row as overlapping with itself, so the results column will never be 0. (Subtract 1 from the result to do it the other way.)

import pandas as pd
df = pd.DataFrame({'start_time': [4,3,1,2],'end_time': [7,5,3,8]})
df = df[['start_time','end_time']] #just changing the order of the columns for aesthetics

def overlaps_with_row(row,frame):
    starts_before_mask = frame.start_time <= row.start_time
    ends_after_mask = frame.end_time > row.start_time
    return (starts_before_mask & ends_after_mask).sum()

df['number_which_overlap'] = df.apply(overlaps_with_row,frame=df,axis=1)

Yields:

In [8]: df
Out[8]: 
   start_time  end_time  number_which_overlap
0           4         7                     3
1           3         5                     2
2           1         3                     1
3           2         8                     2

[4 rows x 3 columns]

Upvotes: 1

kimal
kimal

Reputation: 692

You can try below code to get the final result.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[2,10],[5,8],[3,8],[6,9]]),columns=["start","end"])

active_events= {}
for i in df.index:
    active_events[i] = len(df[(df["start"]<=df.loc[i,"start"]) & (df["end"]> df.loc[i,"start"])])
last_columns = pd.DataFrame({'No. active events' : pd.Series(active_events)})

df.join(last_columns)

Upvotes: 1

Related Questions