Laura
Laura

Reputation: 1

Match two dataframes on multiple criteria within range of values in Python

I need some tips/clues on how it would be possible to match two dataframes in Python on multiple criteria that some of them are looking for values within range. The example:

1.First dataframe is a contract where we have locations from country, from postal code low, from postal code high, to country, to postal code low, to postal code high, contract ID.

from country from postal code low from postal code high to country to postal code low to postal code high ID
SE         0         19999            DE      90000         99999   ID1
SE     20000         29999            DE      90000         99999   ID2
SE     30000         39999            DE      90000         99999   ID3
SE     40000         49999            DE      90000         99999   ID4
SE     41250         41250            DE      90000         99999   ID5

2.Second dataframe is statistics file with exact postal code from and to and I need to find all unique matches from the first database:

From country  from postal code    to country  To postal code   ID (that should be the result):
SE              21789              DE          91000            ID2
SE              41250              DE          91000            ID4, ID5

In excel it could be done by Index and Match and array function, since both dataframes involves tens or even hundred of thousands rows, doing it in excel is not the optimal option, so looking to Python solution. I am a beginner in Python so at least a clue on what should I look into would be helpful. I know pandas merge function, but it requires exact match of columns, is it pandassql that could help, but I am not sure how can I refer to another DB as criteria for Where postal code <= postal code low & postal code >= postal code high?

Any tips?

Visualization of df:

enter image description here

Upvotes: 0

Views: 644

Answers (2)

Henry Yik
Henry Yik

Reputation: 22503

One way is to do a nested list comprehension:

df2["ID"] = [[idx for lo, hi, idx in zip(df["from postal code low"],
                                         df["from postal code high"],
                                         df["ID"]) if i in range(lo, hi+1)]
             for i in df2["from postal code"]]

print (df2)

  From country  from postal code to country  To postal code          ID
0           SE             21789         DE           91000       [ID2]
1           SE             41250         DE           91000  [ID4, ID5]

Upvotes: 2

Ezer K
Ezer K

Reputation: 3739

One approach would be to directly iterate though all options, in this case added range cols for a shorter if syntax at the loop:

contract_df['from_range'] = [range(t[0], t[1]) for t in zip(contract_df['from_postal_code_low'], contract_df['from_postal_code_high']+1)]
contract_df['to_range'] = [range(t[0], t[1]) for t in zip(contract_df['to_postal_code_low'], contract_df['to_postal_code_high']+1)]

lol = []
for r in stat_df.iterrows():
    l = []
    for r1 in contract_df.iterrows():
        if (r[1]['from_country'] == r1[1]['from_country']) and (r[1]['to_country'] == r1[1]['to_country']) and (r[1]['from_postal_code'] in r1[1]['from_range']) and (r[1]['to_postal_code'] in r1[1]['to_range']):
            l.append(r1[1]['ID'])
    lol.append(l)

stat_df['ID'] = lol

Upvotes: 0

Related Questions