Reputation: 460
I'm trying to batch some data based on a start_date
and end_date
that is conditional of the cumulative sum of which is <= 500000.
Say I have a simple data frame with two columns:
index Date num_books
0 2021-01-01 200000
1 2021-01-02 240000
2 2021-01-03 55000
3 2021-01-04 400000
4 2021-01-05 80000
5 2021-01-06 100000
I need to do a cumulative sum of the values in num_books
until it has <= 500000 and record the start date, end date and the cumsum value. This is an example of what I'm trying to achieve
start_date end_date cumsum_books
2021-01-01 2021-01-03 495000
2021-01-04 2021-01-05 480000
2021-01-06 2021-01-06 100000
Is there an efficient way/function to achieve this? Thank you!
Upvotes: 2
Views: 437
Reputation: 16561
Here's one way:
from io import StringIO as sio
d = sio("""
index Date num_books
0 2021-01-01 200000
1 2021-01-02 240000
2 2021-01-03 55000
3 2021-01-04 400000
4 2021-01-05 80000
5 2021-01-06 100000
""")
import pandas as pd
df = pd.read_csv(d, sep='\s+')
batch_num = 5*10**5
df['batch_num'] = df['num_books'].cumsum()//batch_num
result = df.groupby('batch_num').agg(start_date=('Date', 'min'), end_date=('Date', 'max'), cumsum_books=('num_books','sum'))
print(result)
# start_date end_date cumsum_books
#batch_num
#0 2021-01-01 2021-01-03 495000
#1 2021-01-04 2021-01-05 480000
#2 2021-01-06 2021-01-06 100000
Note that the result
dataframe also contains the entry with more than 500_000
, but it's trivial to drop/filter it out.
Upvotes: 3