Reputation: 101
I have two dataframes. one has product ID information and the other is a master data frame with a zone mapping along with a mapping ID.
# this is dummy dataframe for example
product_df= pd.DataFrame([['abc','+1235']['cshdgs','+1352648'],['gdsfsn','+1232455']],columns='['roll','prod_id'])
master_df=pd.DataFrame([['AZ','32'],['WW','123'],['RT','12'],['PO','13'],['SZ','1352']], columns=['Zone','match_id']
I want to get Zone information alongside the product_df records. and the logic for that is (inSQL):
select product_df.* left join master_df.Zone where '+'||match_id=substr(prod_id,1,2) or '+'||match_id=substr(prod_id,1,3) or '+'||match_id=substr(prod_id,1,4) or '+'||match_id=substr(prod_id,1,5)
Basically this will be a merge on condition situation. I know, that due to this joining logic, multiple zones may be mapped to the same roll. but that is the ask.
I am trying to implement the same in Python. I am using the following code:
master_df_dict=master_df.sert_index('match_id')['Zone'].to_dict
keys_list=['+' + key for key in master_df_dict.keys()]
def zone(pr_id):
if pr_id[0:2] in keys_list:
C=keys_list[pr_id[1:2]] # basically getting the zone information using the matched key, and as
#first character is always plus, starting the index at 1
elif pr_id[0:3] in keys_list:
C=keys_list[pr_id[1:3]]
elif pr_id[0:4] in keys_list:
C=keys_list[pr_id[1:4]]
elif pr_id[0:5] in keys_list:
C=keys_list[pr_id[1:5]]
else:
C=''
return C
product_df['Zone_info']=product_df['prod_id].apply(zone)
There are two problems with this approach:
It will only give me the first matched code, even if later conditions will also match, it will come out of the loop as soon as it matches with a condition.
Using this approach, it is taking me around 45 mins to parse 1700 records.
I need help in the following areas:
why is it taking so long for the above code to work? How can i make it run faster
Can I get exact python execution for the sql logic mentioned above that is joining based on multiple conditions? As far as i have searched, python does not have the fucntionality to merge on conditions as is in sql. Can we have a workaround for this?
If there is no way to merge on all the or conditions, is there a way to merge atleast the first matching condition and make it run faster?
Please help!!
Upvotes: 2
Views: 505
Reputation: 31226
Your matching criteria is not really a key. In these cases I form a Cartesian product, then establish a way to find unique rows required. This is pretty equivalent to a way that a relational database join works and especially in join is an inefficient expression.
sort_values()
c) join true if match_id lines up with prod_id on the criteria you specifiedBetter solution would to have a reliable join key...
import re
product_df= pd.DataFrame(
[['abc','+1235'],['cshdgs','+1352648'],['gdsfsn','+1232455']],columns=['roll','prod_id'])
master_df=pd.DataFrame([['AZ','32'],['WW','123'],['RT','12'],['PO','13'],['SZ','1352']], columns=['Zone','match_id'])
cp = product_df.assign(foo=1).merge(master_df.assign(foo=1)).drop("foo",1)
cp["len"] = cp.match_id.str.len()
cp["expr"] = cp.apply(lambda r: "^[+]" + "".join([f"[{c}]{'' if i<2 else '?'}" for i, c in enumerate(r.match_id)]), axis=1)
cp["join"] = cp.apply(lambda r: re.search(r.expr, r.prod_id) is not None, axis=1)
cp = cp.sort_values(["Zone", "join", "len"], ascending=[True, False, False]).reset_index()\
.groupby(["Zone"]).first().reset_index()
cp.loc[~cp["join"],("roll","prod_id")] = ""
cp.drop(["len","expr","join","index"], axis=1)
output
Zone roll prod_id match_id
0 AZ 32
1 PO cshdgs +1352648 13
2 RT abc +1235 12
3 SZ cshdgs +1352648 1352
4 WW abc +1235 123
Upvotes: 2