Reputation: 258
Arising from my previous question, split and precised as suggested by answering user: Pandas: How to group and aggregate data starting from constant and ending on changing date?
I need to aggregate data between constant date, like first day of year, and all the other dates through the year. The aggregation is count of unique values in given time period while grouping it by date. In other words its counting unique users up to date since either the beginning of the month or the year.
date value
01-01-2012 a
02-01-2012 b
05-01-2012 c
05-01-2012 c
01-02-2012 a
02-02-2012 a
02-02-2012 b
05-02-2012 d
which should output:
date Month to date unique Year to date unique
01-01-2012 1 1
02-01-2012 2 2
05-01-2012 3 3
01-02-2012 1 3
02-02-2012 2 3
05-02-2012 3 4
The data is, of course, in Pandas data frame. The obvious, but very clumsy way is to create for loop between the starting dates and the moving one. The problem looks like a popular one. Is there some reasonable pandas builtin way for such type of computation? Regarding counting unique I also want to avoid stacking lists, as I have large number of rows and unique values, but I'm not sure if its possible.
Date format does not matter at all here.
I was checking out Pandas window functions, but it doesn't look like a solution.
Upvotes: 2
Views: 333
Reputation: 862911
I change previous solution by transform
function and then remove duplciates per date
s :
df["date"] = pd.to_datetime(df["date"], format="%d-%m-%Y")
df = df.sort_values('date')
f = lambda x: (~x.duplicated()).cumsum()
df["Month to date sum"] = df.groupby(df["date"].dt.month)["value"].transform(f)
df["Year to date sum"] = df.groupby(df["date"].dt.year)["value"].transform(f)
df = df.drop_duplicates('date', keep='last').drop('value', axis=1)
print (df)
date Month to date sum Year to date sum
0 2012-01-01 1 1
1 2012-01-02 2 2
3 2012-01-05 3 3
4 2012-02-01 1 3
6 2012-02-02 2 3
7 2012-02-05 3 4
Upvotes: 2