Reputation: 4487
I'd like to create a time series (with pandas), to count distinct value of an Id if start and end date are within the considered date.
For sake of legibility, this is a simplified version of the problem.
Let's define the Data this way:
df = pd.DataFrame({
'customerId': [
'1', '1', '1', '2', '2'
],
'id': [
'1', '2', '3', '1', '2'
],
'startDate': [
'2000-01', '2000-01', '2000-04', '2000-05', '2000-06',
],
'endDate': [
'2000-08', '2000-02', '2000-07', '2000-07', '2000-08',
],
})
And the period range this way:
period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')
For each customerId, there are several distinct id.
The final aim is to get, for each date
of the period-range, for each customerId
, the count of distinct id
whose start_date
and end_date
matches the function my_date_predicate
.
Simplified definition of my_date_predicate
:
unset_date = pd.to_datetime("1900-01")
def my_date_predicate(date, row):
return row.startDate <= date and \
(row.endDate.equals(unset_date) or row.endDate > date)
I'd like a time series result like this:
date customerId customerCount
0 2000-01 1 2
1 2000-01 2 0
2 2000-02 1 1
3 2000-02 2 0
4 2000-03 1 1
5 2000-03 2 0
6 2000-04 1 2
7 2000-04 2 0
8 2000-05 1 2
9 2000-05 2 1
10 2000-06 1 2
11 2000-06 2 2
12 2000-07 1 1
13 2000-07 2 0
How could I use pandas to get such result?
Upvotes: 3
Views: 598
Reputation: 29635
You can do it with 2 pivot_table
to get the count
of id per customer in column per start date (and end date) in index. reindex
each one with the period_date you are interested in. Substract the pivot for end from the pivot for start. Use cumsum
to get the cumulative some of id per customer id. Finally use stack
and reset_index
to bring to the wanted shape.
#convert to period columns like period_date
df['startDate'] = pd.to_datetime(df['startDate']).dt.to_period('M')
df['endDate'] = pd.to_datetime(df['endDate']).dt.to_period('M')
#create the pivots
pvs = (df.pivot_table(index='startDate', columns='customerId', values='id',
aggfunc='count', fill_value=0)
.reindex(period_range, fill_value=0)
)
pve = (df.pivot_table(index='endDate', columns='customerId', values='id',
aggfunc='count', fill_value=0)
.reindex(period_range, fill_value=0)
)
print (pvs)
customerId 1 2
2000-01 2 0 #two id for customer 1 that start at this month
2000-02 0 0
2000-03 0 0
2000-04 1 0
2000-05 0 1 #one id for customer 2 that start at this month
2000-06 0 1
2000-07 0 0
Now you can substract one to the other and use cumsum
to get the wanted amount per date.
res = (pvs - pve).cumsum().stack().reset_index()
res.columns = ['date', 'customerId','customerCount']
print (res)
date customerId customerCount
0 2000-01 1 2
1 2000-01 2 0
2 2000-02 1 1
3 2000-02 2 0
4 2000-03 1 1
5 2000-03 2 0
6 2000-04 1 2
7 2000-04 2 0
8 2000-05 1 2
9 2000-05 2 1
10 2000-06 1 2
11 2000-06 2 2
12 2000-07 1 1
13 2000-07 2 1
Note really sure how to handle the unset_date
as I don't see what is used for
Upvotes: 1
Reputation: 12493
Here's a solution:
df.startDate = pd.to_datetime(df.startDate)
df.endDate = pd.to_datetime(df.endDate)
df["month"] = df.apply(lambda row: pd.date_range(row["startDate"], row["endDate"], freq="MS", closed = "left"), axis=1)
df = df.explode("month")
period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')
t = pd.DataFrame(period_range.to_timestamp(), columns=["month"])
customers_df = pd.DataFrame(df.customerId.unique(), columns = ["customerId"])
t = pd.merge(t.assign(dummy=1), customers_df.assign(dummy=1), on = "dummy").drop("dummy", axis=1)
t = pd.merge(t, df, on = ["customerId", "month"], how = "left")
t.groupby(["month", "customerId"]).count()[["id"]].rename(columns={"id": "count"})
The result is:
count
month customerId
2000-01-01 1 2
2 0
2000-02-01 1 1
2 0
2000-03-01 1 1
2 0
2000-04-01 1 2
2 0
2000-05-01 1 2
2 1
2000-06-01 1 2
2 2
2000-07-01 1 1
2 1
Note:
Upvotes: 2