Reputation: 45
I have a data set like below on emails and purchases.
Email Purchaser order_id amount
[email protected] [email protected] 1 5
[email protected]
[email protected] [email protected] 2 10
[email protected] [email protected] 3 5
I want to find the total number of people in the data set, the number of people who purchased and the total number of orders and total revenue amount. I know how to do it via SQL
using left join
and aggregate functions but I do not know how to replicate this using Python
/pandas
.
For Python
, I attempted this using pandas
and numpy
:
table1 = table.groupby(['Email', 'Purchaser']).agg({'amount': np.sum, 'order_id': 'count'})
table1.agg({'Email': 'count', 'Purchaser': 'count', 'amount': np.sum, 'order_id': 'count'})
The problem is - it is only returning the rows with an order (1st row and 3rd) but not the other ones (2nd row)
Email Purchaser order_id amount
[email protected] [email protected] 1 5
[email protected] [email protected] 2 15
The SQL
query should look like this:
SELECT count(Email) as num_ind, count(Purchaser) as num_purchasers, sum(order) as orders , sum(amount) as revenue
FROM
(SELECT Email, Purchaser, count(order_id) as order, sum(amount) as amount
FROM table 1
GROUP BY Email, Purchaser) x
How can I replicate it in Python
?
Upvotes: 4
Views: 4249
Reputation: 863166
It is not implemented in pandas now - see.
So one awful solution is replace NaN
to some string and after agg
replace back to NaN
:
table['Purchaser'] = table['Purchaser'].replace(np.nan, 'dummy')
print table
Email Purchaser order_id amount
0 [email protected] [email protected] 1 5
1 [email protected] NaN NaN NaN
2 [email protected] [email protected] 2 10
3 [email protected] [email protected] 3 5
table['Purchaser'] = table['Purchaser'].replace(np.nan, 'dummy')
print table
Email Purchaser order_id amount
0 [email protected] [email protected] 1 5
1 [email protected] dummy NaN NaN
2 [email protected] [email protected] 2 10
3 [email protected] [email protected] 3 5
table1 = table.groupby(['Email', 'Purchaser']).agg({'amount': np.sum, 'order_id': 'count'})
print table1
order_id amount
Email Purchaser
[email protected] [email protected] 1 5
[email protected] dummy 0 NaN
[email protected] [email protected] 2 15
table1 = table1.reset_index()
table1['Purchaser'] = table1['Purchaser'].replace('dummy', np.nan)
print table1
Email Purchaser order_id amount
0 [email protected] [email protected] 1 5
1 [email protected] NaN 0 NaN
2 [email protected] [email protected] 2 15
Upvotes: 4