Scott N
Scott N

Reputation: 113

rolling uniques across groups + time in pandas

Having trouble calculating rolling 7 day unique users, by group in a group-user-date dataset. It's a classic metric and figured someone could help me do this in pandas.

Example data:

from StringIO import StringIO
import pandas as pd

data = StringIO("""grp1,user,date
    a,1,2016-10-10
    a,1,2016-10-09
    a,1,2016-10-07
    a,2,2016-10-09
    a,2,2016-10-06
    a,3,2016-10-10
    a,3,2016-10-09
    """)

df = pd.read_csv(data)

For this simple dataset, I want to return:

    a, 2016-10-10, 3  <- 3 users were in group a in the 7 days ending 10/10
    a, 2016-10-09, 3  <- 3 users were in group a in the 7 days ending 10/09
    a, 2016-10-07, 2  <- 2 users were in group a in the 7 days ending 10/07
    a, 2016-10-06, 1  <- 1 users were in group a in the 7 days ending 10/06

I don't mind if it's a transform of the original dataset or an aggregation.

Have tried 1) a lot of searching and 2) a lot of variations of

from datetime import datetime, timedelta

rolling_uniques = lambda x: x['user'].unique().size if x['date'] + timedelta(days=6) <= x['date'].max() else 0

df.apply(rolling_uniques, axis=1)

OR

df.groupby(['grp1', 'user', 'date']).transform(rolling_uniques)

but nothing is working out. In my data I have multiple group columns and of course more categories within grp1 than just 'a'.

Upvotes: 1

Views: 100

Answers (1)

Romain
Romain

Reputation: 21898

I don't now if it is the exactly the expected result, but I think it can help you. Let me know.

# Test data
data = io.StringIO("""grp1,user,date
    a,1,2016-10-10
    a,1,2016-10-09
    a,1,2016-10-07
    a,2,2016-10-09
    a,2,2016-10-06
    a,3,2016-10-10
    a,3,2016-10-09
    b,1,2016-10-09
    b,2,2016-10-10
    """)


df = pd.read_csv(data)
df['date'] = pd.to_datetime(df['date'])
# Setting and sorting the index
df.set_index('date', inplace=True)
df.sort_index(inplace=True)

# Resampling data by preserving the group
df = df.groupby([df.index.to_period('D'), df['grp1']]).sum()
df = df.unstack('grp1')
df = df.resample('D').sum().fillna(0)
# Computing the rolling sum
df = df.rolling(7, min_periods=0).sum()

# Formatting
df = df.stack()
df = df.swaplevel(0,1)

print(df)
#                   user
# grp1  date            
#     a 2016-10-06   2.0
#     b 2016-10-06   0.0
#     a 2016-10-07   3.0
#     b 2016-10-07   0.0
#     a 2016-10-08   3.0
#     b 2016-10-08   0.0
#     a 2016-10-09   9.0
#     b 2016-10-09   1.0
#     a 2016-10-10  13.0
#     b 2016-10-10   3.0

Upvotes: 1

Related Questions