Reputation: 1
I need some tips/clues on how it would be possible to match two dataframes in Python on multiple criteria that some of them are looking for values within range. The example:
1.First dataframe is a contract where we have locations from country, from postal code low, from postal code high, to country, to postal code low, to postal code high, contract ID.
from country from postal code low from postal code high to country to postal code low to postal code high ID
SE 0 19999 DE 90000 99999 ID1
SE 20000 29999 DE 90000 99999 ID2
SE 30000 39999 DE 90000 99999 ID3
SE 40000 49999 DE 90000 99999 ID4
SE 41250 41250 DE 90000 99999 ID5
2.Second dataframe is statistics file with exact postal code from and to and I need to find all unique matches from the first database:
From country from postal code to country To postal code ID (that should be the result):
SE 21789 DE 91000 ID2
SE 41250 DE 91000 ID4, ID5
In excel it could be done by Index and Match and array function, since both dataframes involves tens or even hundred of thousands rows, doing it in excel is not the optimal option, so looking to Python solution. I am a beginner in Python so at least a clue on what should I look into would be helpful. I know pandas merge function, but it requires exact match of columns, is it pandassql that could help, but I am not sure how can I refer to another DB as criteria for Where postal code <= postal code low & postal code >= postal code high?
Any tips?
Visualization of df:
Upvotes: 0
Views: 644
Reputation: 22503
One way is to do a nested list comprehension:
df2["ID"] = [[idx for lo, hi, idx in zip(df["from postal code low"],
df["from postal code high"],
df["ID"]) if i in range(lo, hi+1)]
for i in df2["from postal code"]]
print (df2)
From country from postal code to country To postal code ID
0 SE 21789 DE 91000 [ID2]
1 SE 41250 DE 91000 [ID4, ID5]
Upvotes: 2
Reputation: 3739
One approach would be to directly iterate though all options, in this case added range cols for a shorter if syntax at the loop:
contract_df['from_range'] = [range(t[0], t[1]) for t in zip(contract_df['from_postal_code_low'], contract_df['from_postal_code_high']+1)]
contract_df['to_range'] = [range(t[0], t[1]) for t in zip(contract_df['to_postal_code_low'], contract_df['to_postal_code_high']+1)]
lol = []
for r in stat_df.iterrows():
l = []
for r1 in contract_df.iterrows():
if (r[1]['from_country'] == r1[1]['from_country']) and (r[1]['to_country'] == r1[1]['to_country']) and (r[1]['from_postal_code'] in r1[1]['from_range']) and (r[1]['to_postal_code'] in r1[1]['to_range']):
l.append(r1[1]['ID'])
lol.append(l)
stat_df['ID'] = lol
Upvotes: 0