How to use group by and return rows with null values

Question

I have a data set like below on emails and purchases.

Email          Purchaser    order_id    amount 
a@gmail.com    a@gmail.com    1         5
b@gmail.com         
c@gmail.com    c@gmail.com    2         10
c@gmail.com    c@gmail.com    3         5

I want to find the total number of people in the data set, the number of people who purchased and the total number of orders and total revenue amount. I know how to do it via SQL using left join and aggregate functions but I do not know how to replicate this using Python/pandas.

For Python, I attempted this using pandas and numpy:

table1 = table.groupby(['Email', 'Purchaser']).agg({'amount': np.sum, 'order_id': 'count'})

table1.agg({'Email': 'count', 'Purchaser': 'count', 'amount': np.sum, 'order_id': 'count'})

The problem is - it is only returning the rows with an order (1st row and 3rd) but not the other ones (2nd row)

Email          Purchaser      order_id    amount 
a@gmail.com    a@gmail.com    1           5
c@gmail.com    c@gmail.com    2           15

The SQL query should look like this:

SELECT count(Email) as num_ind, count(Purchaser) as num_purchasers, sum(order) as orders , sum(amount) as revenue
    FROM
        (SELECT Email, Purchaser, count(order_id) as order, sum(amount) as amount
        FROM table 1 
        GROUP BY Email, Purchaser) x

How can I replicate it in Python?

jezrael · Accepted Answer

It is not implemented in pandas now - see.

So one awful solution is replace NaN to some string and after agg replace back to NaN:

table['Purchaser'] = table['Purchaser'].replace(np.nan, 'dummy')

print table
         Email    Purchaser  order_id  amount
0  a@gmail.com  a@gmail.com         1       5
1  b@gmail.com          NaN       NaN     NaN
2  c@gmail.com  c@gmail.com         2      10
3  c@gmail.com  c@gmail.com         3       5

table['Purchaser'] = table['Purchaser'].replace(np.nan, 'dummy')
print table
         Email    Purchaser  order_id  amount
0  a@gmail.com  a@gmail.com         1       5
1  b@gmail.com        dummy       NaN     NaN
2  c@gmail.com  c@gmail.com         2      10
3  c@gmail.com  c@gmail.com         3       5

table1 = table.groupby(['Email', 'Purchaser']).agg({'amount': np.sum, 'order_id': 'count'})
print table1
                         order_id  amount
Email       Purchaser                    
a@gmail.com a@gmail.com         1       5
b@gmail.com dummy               0     NaN
c@gmail.com c@gmail.com         2      15

table1 = table1.reset_index()
table1['Purchaser'] = table1['Purchaser'].replace('dummy', np.nan)
print table1
         Email    Purchaser  order_id  amount
0  a@gmail.com  a@gmail.com         1       5
1  b@gmail.com          NaN         0     NaN
2  c@gmail.com  c@gmail.com         2      15

How to use group by and return rows with null values

Answers (1)

Related Questions