user8606929
user8606929

Reputation: 225

Python Pandas Dataframe - Optimize Search for id in another Dataframe

The following scenario is given.

I have 2 dataframes called orders and customers.

I want to look where the CustomerID from the OrderDataFrame is in the LinkedCustomer column of the Customer Dataframe. The LinkedCustomers field is an array of CustomerIds.

The orders dataframe contains approximately 5.800.000 items. The customer dataframe contains approximately 180 000 items.

I am looking for a way to optimize the following code, because this code runs but is very slow. How can I speed this up?


# demo data -- In the real scenario this data was read from csv-/json files.
orders = pd.DataFrame({'custId': [1, 2, 3, 4], 'orderId': [2,3,4,5]})
customers = pd.DataFrame({'id':[5,6,7], 'linkedCustomers': [{1,2}, {4,5,6}, {3, 7, 8, 9}]})


def getMergeCustomerID(row):
    customerOrderId = row['custId']
    searchMasterCustomer = customers[customers['linkedCustomers'].str.contains(str(customerOrderId))]
    searchMasterCustomer = searchMasterCustomer['id'] 
    if len(searchMasterCustomer) > 0:      
        return searchMasterCustomer
    else:
        return customerOrderId

orders['newId'] = orders.apply(lambda x: getMergeCustomerID(x), axis=1)


# expected result
  custId  orderId  newId
   1        2        5
   2        3        5
   3        4        7
   4        5        6

Upvotes: 3

Views: 203

Answers (1)

jimifiki
jimifiki

Reputation: 5534

I think that in some circumstances this approach can solve your problem: Build a dictionary first,

myDict = {}
for i,j in customers.iterrows():
    for j2 in j[1]:
        myDict[j2]=j[0]

then use the dictionary to create the new column:

orders['newId'] = [myDict[i] for i in orders['custId']]

IMO even though this can solve your problem (speed up your program) this is not the most generic solution. Better answers are welcome!

Upvotes: 1

Related Questions