Reputation: 7997
I have a list of customers with a "start date" and and "end date". For any given time period, my goal is to find how many customers I have active. A customer is active if their start date is before x and their end date is after x. I've written a brute force version of this:
from datetime import datetime
import pandas as pd
#dates of interest
dates = ['2016-01-31','2016-02-29','2016-03-31','2016-04-30','2016-05-31']
dates = [datetime.strptime(x, '%Y-%m-%d') for x in dates]
#sample records
df = pd.DataFrame( [['A','2016-01-01','2016-04-23'],['B','2016-02-05','2016-04-30'],['C','2016-02-02','2016-05-25']],columns = ['customerId','startDate','endDate'])
df['startDate'] = pd.to_datetime(df['startDate'])
df['endDate'] = pd.to_datetime(df['endDate'])
output = []
#is there a better way to do this?
for currDate in dates:
record_count = len(df[(df['startDate']<= currDate) & (df['endDate']>= currDate)])
output.append([currDate,record_count])
output = pd.DataFrame(output, columns = ['date','active count'])
Is there a better way to find how many customers are active between each date of interest? Right now I'm just iterating through all the dates, but that doesn't feel very "pythonic" to me.
Any thoughts or assistance would be appreciated!
Upvotes: 0
Views: 144
Reputation: 16629
One way would be:
In [142]: tf = pd.DataFrame({'dates': dates})
In [143]: tf['active_count'] = tf['dates'].apply(lambda x: df[(df['startDate']<= x) & (df['endDate']>= x)].count())
In [144]: tf
Out[144]:
dates active_count
0 2016-01-31 1
1 2016-02-29 3
2 2016-03-31 3
3 2016-04-30 2
4 2016-05-31 0
Upvotes: 1