ROma
ROma

Reputation: 101

Merge on multiple or conditions in python

I have two dataframes. one has product ID information and the other is a master data frame with a zone mapping along with a mapping ID.

# this is dummy dataframe for example
product_df= pd.DataFrame([['abc','+1235']['cshdgs','+1352648'],['gdsfsn','+1232455']],columns='['roll','prod_id'])

master_df=pd.DataFrame([['AZ','32'],['WW','123'],['RT','12'],['PO','13'],['SZ','1352']], columns=['Zone','match_id']

I want to get Zone information alongside the product_df records. and the logic for that is (inSQL):

select product_df.* left join master_df.Zone where '+'||match_id=substr(prod_id,1,2) or  '+'||match_id=substr(prod_id,1,3) or  '+'||match_id=substr(prod_id,1,4) or  '+'||match_id=substr(prod_id,1,5)

Basically this will be a merge on condition situation. I know, that due to this joining logic, multiple zones may be mapped to the same roll. but that is the ask.

I am trying to implement the same in Python. I am using the following code:

master_df_dict=master_df.sert_index('match_id')['Zone'].to_dict
keys_list=['+' + key for key in master_df_dict.keys()]

def zone(pr_id):
    if pr_id[0:2] in keys_list:
         C=keys_list[pr_id[1:2]] # basically getting the zone information using the matched key, and as 
                                 #first character is always plus, starting the index at 1
    elif pr_id[0:3] in keys_list:
         C=keys_list[pr_id[1:3]]
    elif pr_id[0:4] in keys_list:
         C=keys_list[pr_id[1:4]]
    elif pr_id[0:5] in keys_list:
         C=keys_list[pr_id[1:5]]
    else:
         C=''
    return C

product_df['Zone_info']=product_df['prod_id].apply(zone)

There are two problems with this approach:

  1. It will only give me the first matched code, even if later conditions will also match, it will come out of the loop as soon as it matches with a condition.

  2. Using this approach, it is taking me around 45 mins to parse 1700 records.

I need help in the following areas:

  1. why is it taking so long for the above code to work? How can i make it run faster

  2. Can I get exact python execution for the sql logic mentioned above that is joining based on multiple conditions? As far as i have searched, python does not have the fucntionality to merge on conditions as is in sql. Can we have a workaround for this?

  3. If there is no way to merge on all the or conditions, is there a way to merge atleast the first matching condition and make it run faster?

Please help!!

Upvotes: 2

Views: 505

Answers (1)

Rob Raymond
Rob Raymond

Reputation: 31226

Your matching criteria is not really a key. In these cases I form a Cartesian product, then establish a way to find unique rows required. This is pretty equivalent to a way that a relational database join works and especially in join is an inefficient expression.

  1. Cartesian product idiom
  2. derive 3 additional columns a) expr - regular expression to match match_id to prod_id b) just length to be used on sort_values() c) join true if match_id lines up with prod_id on the criteria you specified
  3. sort and take first interesting record
  4. reset values where no join was found
  5. drop the columns used to get this working

Better solution would to have a reliable join key...

import re
product_df= pd.DataFrame(
    [['abc','+1235'],['cshdgs','+1352648'],['gdsfsn','+1232455']],columns=['roll','prod_id'])

master_df=pd.DataFrame([['AZ','32'],['WW','123'],['RT','12'],['PO','13'],['SZ','1352']], columns=['Zone','match_id'])

cp = product_df.assign(foo=1).merge(master_df.assign(foo=1)).drop("foo",1)

cp["len"] = cp.match_id.str.len()
cp["expr"] = cp.apply(lambda r: "^[+]" + "".join([f"[{c}]{'' if i<2 else '?'}" for i, c in enumerate(r.match_id)]), axis=1)
cp["join"] = cp.apply(lambda r: re.search(r.expr, r.prod_id) is not None, axis=1)
cp = cp.sort_values(["Zone", "join", "len"], ascending=[True, False, False]).reset_index()\
    .groupby(["Zone"]).first().reset_index()
cp.loc[~cp["join"],("roll","prod_id")] = ""
cp.drop(["len","expr","join","index"], axis=1)

output

Zone    roll    prod_id match_id
0   AZ          32
1   PO  cshdgs  +1352648    13
2   RT  abc +1235   12
3   SZ  cshdgs  +1352648    1352
4   WW  abc +1235   123

Upvotes: 2

Related Questions