Reputation: 4842
I have a df
like this:
Country product date_install date_purchase id
BR yearly 2020-11-01-01:11:36 2020-11-01-01:11:26 10660236
CA monthly 2020-11-01-01:11:49 2020-11-01-01:11:32 10649441
US yearly 2020-11-01-01:11:54 2020-11-01-01:11:33 10660272
IT monthly 2020-11-01-11:11:01 2020-11-01-01:11:34 10657634
AE monthly 2020-11-01-01:11:38 2020-11-01-01:11:39 10661442
US NaN 2021-01-12-03:01:31 NaN 12815946
CA NaN 2020-12-04-02:12:48 NaN 11647714
US NaN 2020-12-28-11:12:54 NaN 12323174
ID NaN 2021-02-02-01:02:58 NaN 13714980
US NaN 2020-11-15-10:11:05 NaN 11056138
I am trying to get this:
country product installs purchases
BR yearly 1 1
BR NaN 100 0 # people who installed but not purchased
CA monthly 1 1
US yearly 10 10
US monthly 15 15
US NaN 500 0 # people who installed but not purchased
Or even more better would be:
country installs yearly monthly total
BR 1000 10 100 110
CA 2000 50 5 55
I tried:
df.groupby(['country','product']).count().sort_values('date_install',ascending=False)
But then all the values are the same, matching the number of purchases, meaning everyone who installs purchases.
date_install date_purchase id
country product
US monthly 3373 3373 3373
AU monthly 1478 1478 1478
US yearly 954 954 954
And if I use:
df = df.replace(np.nan, 'empty', regex=True)
df.groupby(['country','product']).count().sort_values('date_install',ascending=False)
I get:
date_install date_purchase id
country product
US empty 480153 480153 480153
AU empty 334236 334236 334236
BR empty 144920 144920 144920
How can I achieve this result?
Upvotes: 1
Views: 306
Reputation: 1501
Indeed, if you follow @Paul Brennan's advice, the solution comes quite easier. As an example consider the following data
Country product date_install date_purchase id
0 BR yearly 2020-01-01-01:00:00 2020-01-01-01:00:00 10660236
3 BR monthly 2020-01-01-04:00:00 2020-01-01-04:00:00 10660239
6 BR NaN 2020-01-01-07:00:00 NaN 10660242
9 BR NaN 2020-01-01-10:00:00 NaN 10660245
1 CA yearly 2020-01-01-02:00:00 2020-01-01-02:00:00 10660237
4 CA yearly 2020-01-01-05:00:00 2020-01-01-05:00:00 10660240
7 CA NaN 2020-01-01-08:00:00 NaN 10660243
10 CA yearly 2020-01-01-11:00:00 2020-01-01-11:00:00 10660246
2 US monthly 2020-01-01-03:00:00 2020-01-01-03:00:00 10660238
5 US NaN 2020-01-01-06:00:00 NaN 10660241
8 US monthly 2020-01-01-09:00:00 2020-01-01-09:00:00 10660244
11 US monthly 2020-01-01-12:00:00 2020-01-01-12:00:00 10660247
Let's say the 'not purchased' version is a demo
or something like that:
df['product'] = df['product'].fillna('demo')
You can do the following
ans = (df.groupby([df['Country'], df['product']])
.apply(len)
.unstack()
.fillna(0)
.astype(int)
.rename_axis(columns='', index='')
.assign(installed=lambda x: x[['demo', 'monthly', 'yearly']].sum(axis=1),
purchased=lambda x: x[['monthly', 'yearly']].sum(axis=1))
)
And the resulting dataframe is as follows:
demo monthly yearly installed purchased
BR 2 1 1 4 2
CA 1 0 3 4 3
US 1 3 0 4 3
Answering your comment's question, you can't keep the dates from each user since groupby
summarizes all the information, losing these individual details.
What's possible to do is to assign the desirable columns from the resulting dataframe to the first one (getting some duplicate values), like:
df = df.assign(purchased=df['Country'].map(ans['purchased']),
installed=df['Country'].map(ans['installed']))
This will make you first dataframe looks like:
Country product date_install date_purchase id purchased installed
0 BR yearly 2020-01-01-01:00:00 2020-01-01-01:00:00 10660236 2 4
1 CA yearly 2020-01-01-02:00:00 2020-01-01-02:00:00 10660237 3 4
2 US monthly 2020-01-01-03:00:00 2020-01-01-03:00:00 10660238 3 4
.
.
.
If this is not what you wanted, please let us know and we'll try to figure it out.
Upvotes: 1