Reputation: 113
Having trouble calculating rolling 7 day unique users, by group in a group-user-date dataset. It's a classic metric and figured someone could help me do this in pandas.
Example data:
from StringIO import StringIO
import pandas as pd
data = StringIO("""grp1,user,date
a,1,2016-10-10
a,1,2016-10-09
a,1,2016-10-07
a,2,2016-10-09
a,2,2016-10-06
a,3,2016-10-10
a,3,2016-10-09
""")
df = pd.read_csv(data)
For this simple dataset, I want to return:
a, 2016-10-10, 3 <- 3 users were in group a in the 7 days ending 10/10
a, 2016-10-09, 3 <- 3 users were in group a in the 7 days ending 10/09
a, 2016-10-07, 2 <- 2 users were in group a in the 7 days ending 10/07
a, 2016-10-06, 1 <- 1 users were in group a in the 7 days ending 10/06
I don't mind if it's a transform of the original dataset or an aggregation.
Have tried 1) a lot of searching and 2) a lot of variations of
from datetime import datetime, timedelta
rolling_uniques = lambda x: x['user'].unique().size if x['date'] + timedelta(days=6) <= x['date'].max() else 0
df.apply(rolling_uniques, axis=1)
OR
df.groupby(['grp1', 'user', 'date']).transform(rolling_uniques)
but nothing is working out. In my data I have multiple group columns and of course more categories within grp1 than just 'a'.
Upvotes: 1
Views: 100
Reputation: 21898
I don't now if it is the exactly the expected result, but I think it can help you. Let me know.
# Test data
data = io.StringIO("""grp1,user,date
a,1,2016-10-10
a,1,2016-10-09
a,1,2016-10-07
a,2,2016-10-09
a,2,2016-10-06
a,3,2016-10-10
a,3,2016-10-09
b,1,2016-10-09
b,2,2016-10-10
""")
df = pd.read_csv(data)
df['date'] = pd.to_datetime(df['date'])
# Setting and sorting the index
df.set_index('date', inplace=True)
df.sort_index(inplace=True)
# Resampling data by preserving the group
df = df.groupby([df.index.to_period('D'), df['grp1']]).sum()
df = df.unstack('grp1')
df = df.resample('D').sum().fillna(0)
# Computing the rolling sum
df = df.rolling(7, min_periods=0).sum()
# Formatting
df = df.stack()
df = df.swaplevel(0,1)
print(df)
# user
# grp1 date
# a 2016-10-06 2.0
# b 2016-10-06 0.0
# a 2016-10-07 3.0
# b 2016-10-07 0.0
# a 2016-10-08 3.0
# b 2016-10-08 0.0
# a 2016-10-09 9.0
# b 2016-10-09 1.0
# a 2016-10-10 13.0
# b 2016-10-10 3.0
Upvotes: 1