user11846775
user11846775

Reputation:

Merging DataFrames in Pandas and Numpy

I have two different data frames pertaining to sales analytics. I would like to merge them together to make a new data frame with the columns customer_id, name, and total_spend. The two data frames are as follows:

import pandas as pd
import numpy as np

customers = pd.DataFrame([[100, 'Prometheus Barwis', '[email protected]',
        '(533) 072-2779'],[101, 'Alain Hennesey', '[email protected]',
        '(942) 208-8460'],[102, 'Chao Peachy', '[email protected]',
        '(510) 121-0098'],[103, 'Somtochukwu Mouritsen',
        '[email protected]','(669) 504-8080'],[104,
        'Elisabeth Berry', '[email protected]','(802) 973-8267']],
        columns = ['customer_id', 'name', 'email', 'phone'])

orders = pd.DataFrame([[1000, 100, 144.82], [1001, 100, 140.93],
       [1002, 102, 104.26], [1003, 100, 194.6 ], [1004, 100, 307.72],
       [1005, 101,  36.69], [1006, 104,  39.59], [1007, 104, 430.94],
       [1008, 103,  31.4 ], [1009, 104, 180.69], [1010, 102, 383.35],
       [1011, 101, 256.2 ], [1012, 103, 930.56], [1013, 100, 423.77],
       [1014, 101, 309.53], [1015, 102, 299.19]],
       columns = ['order_id', 'customer_id', 'order_total'])

When I group by customer_id and order_id I get the following table:

customer_id  order_id  order_total

100           1000       144.82
              1001       140.93
              1003       194.60
              1004       307.72
              1013       423.77
101           1005       36.69
              1011       256.20
              1014       309.53
102           1002       104.26
              1010       383.35
              1015       299.19
103           1008       31.40
              1012       930.56
104           1006       39.59
              1007       430.94
              1009       180.69

This is where I get stuck. I do not know how to sum up all of the orders for each customer_id in order to make a total_spent column. If anyone knows of a way to do this it would be much appreciated!

Upvotes: 2

Views: 283

Answers (3)

fpersyn
fpersyn

Reputation: 1096

df_merge = customers.merge(orders, how='left', left_on='customer_id', right_on='customer_id').filter(['customer_id','name','order_total'])
df_merge = df_merge.groupby(['customer_id','name']).sum()
df_merge = df_merge.rename(columns={'order_total':'total_spend'})
df_merge.sort_values(['total_spend'], ascending=False)

Results in:

                                    total_spend
customer_id name    
100         Prometheus Barwis       1211.84
103         Somtochukwu Mouritsen   961.96
102         Chao Peachy             786.80
104         Elisabeth Berry         651.22
101         Alain Hennesey          602.42

A step-by-step explanation:

  1. Start by merging your orders table onto your customers table using a left join. For this you will need pandas' .merge() method. Be sure to set the how argument to left because the default merge type is inner (which would ignore customers with no orders).

    This step requires some basic understanding of SQL-style merge methods. You can find a good visual overview of the various merge types in this thread.

  2. You can append your merge with the .filter() method to only keep your columns of interest (in your case: customer_id, name and order_total).
  3. Now that you have your merged table, we still need to sum up all the order_total values per customer. To achieve this we need to group all non-numeric columns using .groupby() and then apply an aggregation method on the remaining numeric columns (.sum() in this case).

    The .groupby() documentation link above provides some more examples on this. It is also worth knowing that this is a pattern referred to as "split-apply-combine" in the pandas documentation.

  4. Next you will need to rename your numeric column from order_total to total_spend using the .rename() method and setting its column argument.
  5. And last, but not least, sort your customers by your total_spend column using .sort_values().

I hope that helps.

Upvotes: 1

jpf5046
jpf5046

Reputation: 797

You can create an additional table then merge back to your current output.

# group by customer id and order id to match your current output
df = orders.groupby(['customer_id', 'order_id']).sum()

# create a new lookup table called total by customer
totalbycust = orders.groupby('customer_id').sum()
totalbycust = totalbycust.reset_index()

# only keep the columsn you want
totalbycust = totalbycust[['customer_id', 'order_total']]

# merge bcak to your current table 
df =df.merge(totalbycust, left_on='customer_id', right_on='customer_id')
df = df.rename(columns = {"order_total_x": "order_total", "order_total_y": "order_amount_by_cust"})

# expect output
df

enter image description here

Upvotes: 1

moys
moys

Reputation: 8033

IIUC, you can do something like below

orders.groupby('customer_id')['order_total'].sum().reset_index(name='Customer_Total')

Output

customer_id     Customer_Total
0   100     1211.84
1   101     602.42
2   102     786.80
3   103     961.96
4   104     651.22

Upvotes: 1

Related Questions