How to count unique occurrences grouping by changing time period?

Question

Arising from my previous question, split and precised as suggested by answering user: Pandas: How to group and aggregate data starting from constant and ending on changing date?

I need to aggregate data between constant date, like first day of year, and all the other dates through the year. The aggregation is count of unique values in given time period while grouping it by date. In other words its counting unique users up to date since either the beginning of the month or the year.

date             value
01-01-2012       a
02-01-2012       b
05-01-2012       c
05-01-2012       c

01-02-2012       a
02-02-2012       a
02-02-2012       b
05-02-2012       d

which should output:

date             Month to date unique  Year to date unique
01-01-2012       1                     1
02-01-2012       2                     2
05-01-2012       3                     3
01-02-2012       1                     3
02-02-2012       2                     3
05-02-2012       3                     4

The data is, of course, in Pandas data frame. The obvious, but very clumsy way is to create for loop between the starting dates and the moving one. The problem looks like a popular one. Is there some reasonable pandas builtin way for such type of computation? Regarding counting unique I also want to avoid stacking lists, as I have large number of rows and unique values, but I'm not sure if its possible.

Date format does not matter at all here.

I was checking out Pandas window functions, but it doesn't look like a solution.

jezrael · Accepted Answer

I change previous solution by transform function and then remove duplciates per dates :

df["date"] = pd.to_datetime(df["date"], format="%d-%m-%Y")
df = df.sort_values('date')

f = lambda x: (~x.duplicated()).cumsum()

df["Month to date sum"] = df.groupby(df["date"].dt.month)["value"].transform(f)
df["Year to date sum"] = df.groupby(df["date"].dt.year)["value"].transform(f)
df = df.drop_duplicates('date', keep='last').drop('value', axis=1)
print (df)
        date  Month to date sum  Year to date sum
0 2012-01-01                  1                 1
1 2012-01-02                  2                 2
3 2012-01-05                  3                 3
4 2012-02-01                  1                 3
6 2012-02-02                  2                 3
7 2012-02-05                  3                 4

How to count unique occurrences grouping by changing time period?

Answers (1)

Related Questions