Python Pandas Dataframe - Optimize Search for id in another Dataframe

Question

The following scenario is given.

I have 2 dataframes called orders and customers.

I want to look where the CustomerID from the OrderDataFrame is in the LinkedCustomer column of the Customer Dataframe. The LinkedCustomers field is an array of CustomerIds.

The orders dataframe contains approximately 5.800.000 items. The customer dataframe contains approximately 180 000 items.

I am looking for a way to optimize the following code, because this code runs but is very slow. How can I speed this up?


# demo data -- In the real scenario this data was read from csv-/json files.
orders = pd.DataFrame({'custId': [1, 2, 3, 4], 'orderId': [2,3,4,5]})
customers = pd.DataFrame({'id':[5,6,7], 'linkedCustomers': [{1,2}, {4,5,6}, {3, 7, 8, 9}]})


def getMergeCustomerID(row):
    customerOrderId = row['custId']
    searchMasterCustomer = customers[customers['linkedCustomers'].str.contains(str(customerOrderId))]
    searchMasterCustomer = searchMasterCustomer['id'] 
    if len(searchMasterCustomer) > 0:      
        return searchMasterCustomer
    else:
        return customerOrderId

orders['newId'] = orders.apply(lambda x: getMergeCustomerID(x), axis=1)


# expected result
  custId  orderId  newId
   1        2        5
   2        3        5
   3        4        7
   4        5        6

jimifiki · Accepted Answer

I think that in some circumstances this approach can solve your problem: Build a dictionary first,

myDict = {}
for i,j in customers.iterrows():
    for j2 in j[1]:
        myDict[j2]=j[0]

then use the dictionary to create the new column:

orders['newId'] = [myDict[i] for i in orders['custId']]

IMO even though this can solve your problem (speed up your program) this is not the most generic solution. Better answers are welcome!

Python Pandas Dataframe - Optimize Search for id in another Dataframe

Answers (1)

Related Questions