Jonas Palačionis
Jonas Palačionis

Reputation: 4842

Counting NaN values in pandas group by

I have a df like this:

Country     product                 date_install                date_purchase           id
BR          yearly                  2020-11-01-01:11:36         2020-11-01-01:11:26     10660236
CA          monthly                 2020-11-01-01:11:49         2020-11-01-01:11:32     10649441
US          yearly                  2020-11-01-01:11:54         2020-11-01-01:11:33     10660272
IT          monthly                 2020-11-01-11:11:01         2020-11-01-01:11:34     10657634
AE          monthly                 2020-11-01-01:11:38         2020-11-01-01:11:39     10661442
US          NaN                     2021-01-12-03:01:31         NaN                     12815946
CA          NaN                     2020-12-04-02:12:48         NaN                     11647714
US          NaN                     2020-12-28-11:12:54         NaN                     12323174
ID          NaN                     2021-02-02-01:02:58         NaN                     13714980
US          NaN                     2020-11-15-10:11:05         NaN                     11056138

I am trying to get this:

country     product     installs        purchases
BR          yearly      1               1
BR          NaN         100             0 # people who installed but not purchased
CA          monthly     1               1
US          yearly      10              10
US          monthly     15              15
US          NaN         500             0 # people who installed but not purchased

Or even more better would be:

country     installs    yearly  monthly  total
BR          1000        10      100      110
CA          2000        50      5        55

I tried:

df.groupby(['country','product']).count().sort_values('date_install',ascending=False)

But then all the values are the same, matching the number of purchases, meaning everyone who installs purchases.

                    date_install    date_purchase   id
country product         
US      monthly     3373            3373            3373
AU      monthly     1478            1478            1478
US      yearly      954             954             954

And if I use:

df = df.replace(np.nan, 'empty', regex=True)
df.groupby(['country','product']).count().sort_values('date_install',ascending=False)

I get:

                    date_install    date_purchase   id
country product         
US      empty       480153          480153          480153
AU      empty       334236          334236          334236
BR      empty       144920          144920          144920

How can I achieve this result?

Upvotes: 1

Views: 306

Answers (1)

Ralubrusto
Ralubrusto

Reputation: 1501

Indeed, if you follow @Paul Brennan's advice, the solution comes quite easier. As an example consider the following data

   Country  product         date_install        date_purchase        id
0       BR   yearly  2020-01-01-01:00:00  2020-01-01-01:00:00  10660236
3       BR  monthly  2020-01-01-04:00:00  2020-01-01-04:00:00  10660239
6       BR      NaN  2020-01-01-07:00:00                  NaN  10660242
9       BR      NaN  2020-01-01-10:00:00                  NaN  10660245
1       CA   yearly  2020-01-01-02:00:00  2020-01-01-02:00:00  10660237
4       CA   yearly  2020-01-01-05:00:00  2020-01-01-05:00:00  10660240
7       CA      NaN  2020-01-01-08:00:00                  NaN  10660243
10      CA   yearly  2020-01-01-11:00:00  2020-01-01-11:00:00  10660246
2       US  monthly  2020-01-01-03:00:00  2020-01-01-03:00:00  10660238
5       US      NaN  2020-01-01-06:00:00                  NaN  10660241
8       US  monthly  2020-01-01-09:00:00  2020-01-01-09:00:00  10660244
11      US  monthly  2020-01-01-12:00:00  2020-01-01-12:00:00  10660247

Let's say the 'not purchased' version is a demo or something like that:

df['product'] = df['product'].fillna('demo')

You can do the following

ans = (df.groupby([df['Country'], df['product']])
       .apply(len)
       .unstack()
       .fillna(0)
       .astype(int)
       .rename_axis(columns='', index='')
       .assign(installed=lambda x: x[['demo', 'monthly', 'yearly']].sum(axis=1),
               purchased=lambda x: x[['monthly', 'yearly']].sum(axis=1))
       )

And the resulting dataframe is as follows:

    demo  monthly  yearly  installed  purchased
                                               
BR     2        1       1          4          2
CA     1        0       3          4          3
US     1        3       0          4          3

Answering your comment's question, you can't keep the dates from each user since groupby summarizes all the information, losing these individual details.

What's possible to do is to assign the desirable columns from the resulting dataframe to the first one (getting some duplicate values), like:

df = df.assign(purchased=df['Country'].map(ans['purchased']),
               installed=df['Country'].map(ans['installed']))

This will make you first dataframe looks like:

   Country  product         date_install        date_purchase        id  purchased  installed
0       BR   yearly  2020-01-01-01:00:00  2020-01-01-01:00:00  10660236          2          4
1       CA   yearly  2020-01-01-02:00:00  2020-01-01-02:00:00  10660237          3          4
2       US  monthly  2020-01-01-03:00:00  2020-01-01-03:00:00  10660238          3          4
.
.
.

If this is not what you wanted, please let us know and we'll try to figure it out.

Upvotes: 1

Related Questions