Rabbit K
Rabbit K

Reputation: 45

How to use group by and return rows with null values

I have a data set like below on emails and purchases.

Email          Purchaser    order_id    amount 
[email protected]    [email protected]    1         5
[email protected]         
[email protected]    [email protected]    2         10
[email protected]    [email protected]    3         5

I want to find the total number of people in the data set, the number of people who purchased and the total number of orders and total revenue amount. I know how to do it via SQL using left join and aggregate functions but I do not know how to replicate this using Python/pandas.

For Python, I attempted this using pandas and numpy:

table1 = table.groupby(['Email', 'Purchaser']).agg({'amount': np.sum, 'order_id': 'count'})

table1.agg({'Email': 'count', 'Purchaser': 'count', 'amount': np.sum, 'order_id': 'count'})

The problem is - it is only returning the rows with an order (1st row and 3rd) but not the other ones (2nd row)

Email          Purchaser      order_id    amount 
[email protected]    [email protected]    1           5
[email protected]    [email protected]    2           15

The SQL query should look like this:

SELECT count(Email) as num_ind, count(Purchaser) as num_purchasers, sum(order) as orders , sum(amount) as revenue
    FROM
        (SELECT Email, Purchaser, count(order_id) as order, sum(amount) as amount
        FROM table 1 
        GROUP BY Email, Purchaser) x 

How can I replicate it in Python?

Upvotes: 4

Views: 4249

Answers (1)

jezrael
jezrael

Reputation: 863166

It is not implemented in pandas now - see.

So one awful solution is replace NaN to some string and after agg replace back to NaN:

table['Purchaser'] = table['Purchaser'].replace(np.nan, 'dummy')
print table
         Email    Purchaser  order_id  amount
0  [email protected]  [email protected]         1       5
1  [email protected]          NaN       NaN     NaN
2  [email protected]  [email protected]         2      10
3  [email protected]  [email protected]         3       5

table['Purchaser'] = table['Purchaser'].replace(np.nan, 'dummy')
print table
         Email    Purchaser  order_id  amount
0  [email protected]  [email protected]         1       5
1  [email protected]        dummy       NaN     NaN
2  [email protected]  [email protected]         2      10
3  [email protected]  [email protected]         3       5

table1 = table.groupby(['Email', 'Purchaser']).agg({'amount': np.sum, 'order_id': 'count'})
print table1
                         order_id  amount
Email       Purchaser                    
[email protected] [email protected]         1       5
[email protected] dummy               0     NaN
[email protected] [email protected]         2      15

table1 = table1.reset_index()
table1['Purchaser'] = table1['Purchaser'].replace('dummy', np.nan)
print table1
         Email    Purchaser  order_id  amount
0  [email protected]  [email protected]         1       5
1  [email protected]          NaN         0     NaN
2  [email protected]  [email protected]         2      15

Upvotes: 4

Related Questions