Reputation: 163
I am trying to calculate in Pandas a rolling window over one date column and count the distinct values in another column. Let's say I have this df
dataframe:
date customer
2020-01-01 A
2020-01-02 A
2020-01-02 B
2020-01-03 A
2020-01-03 C
2020-01-03 D
2020-01-04 E
I would like to group by the date
column, create a rolling window of two days and count the distinct values in the column customer
. The expected output would be something like:
date distinct_customers
2020-01-01 NaN --> (first value)
2020-01-02 2.0 --> (distinct customers between 2020-01-01 and 2020-01-02: [A, B])
2020-01-03 4.0 --> (distinct customers between 2020-01-02 and 2020-01-03: [A, B, C, D])
2020-01-04 4.0 --> (distinct customers between 2020-01-03 and 2020-01-04: [A, C, D, E])
It seems easy but I don't seem to find any straight-forward way to achieve that, I've tried using groupby
or rolling
. I don't find other posts solving this issue. Does someone have any idea how to do this? Thanks a lot in advance!
Upvotes: 2
Views: 300
Reputation: 163
Based on the idea of @Musulmon, this one liner should do it:
pd.crosstab(df['date'], df['customer']).rolling(2).sum().clip(0,1).sum(axis=1)
Thanks!
Upvotes: 1