workah0lic
workah0lic

Reputation: 91

Match two pandas dataframe depending on multiple conditions

I have two datasets (df_persons and df_database). Both of them have the same structure:

cu_id sex eye_colour favourite_sport cash_on_account
1 m blue soccer 15
2 f green tennis 25
3 m brown ski 33

(much more rows with various combinations of sex, eye_colour and favourite_sport)

For each individuals of the rows/cu_ids in persons, I'm looking for a similar match in database.

There are certain rules to follow:

Loop through every row/cu_id from persons

In other words, we are looking (in another table) for the most similar user for persons. Every user from the database can only be used once. The order of comparison is important (only if sex+eye_colour+favourite_sport are matching, it's a match - otherwise only if sex+eye_colour or even just sex. Matching sex+favourite_sport is NO valid match).

import pandas as pd
 
database = [['1', 'm', 'blue', 'soccer', 10], ['2', 'm', 'green', 'tennis', 15], ['3', 'f', 'brown', 'ski', 14], ['4', 'm', 'blue', 'soccer', 10], ['5', 'm', 'green', 'tennis', 15], ['6', 'f', 'brown', 'ski', 14], ['7', 'm', 'blue', 'soccer', 10], ['8', 'f', 'green', 'tennis', 15], ['9', 'm', 'brown', 'ski', 14], ['10', 'f', 'blue', 'soccer', 10], ['11', 'm', 'green', 'tennis', 15], ['12', 'm', 'brown', 'ski', 14], ['13', 'f', 'blue', 'tennis', 10], ['14', 'm', 'green', 'ski', 15], ['15', 'f', 'green', 'soccer', 14]]
persons = [['1', 'm', 'blue', 'soccer', 10], ['2', 'm', 'green', 'tennis', 15], ['3', 'f', 'brown', 'ski', 14]]
 
# Create the pandas DataFrame
database = pd.DataFrame(db, columns=['cu_id', 'sex', 'eye_colour', 'favourite_sport', 'cash_on_account'])
persons = pd.DataFrame(data, columns=['cu_id', 'sex', 'eye_colour', 'favourite_sport', 'cash_on_account'])

I simply can't wrap my head around that problem, without using a for loop and extensive comparisons/filtering (especially because I'm just allowed to match every user from database just once).

Is there any guidance you could offer for problems like that?

Best regards, worky

Upvotes: 0

Views: 749

Answers (1)

Onur Guven
Onur Guven

Reputation: 640

Would merging these dataframes on your conditions in the given order and only checking non matched ones in the next merge work for you? Like this:

database = [['1', 'm', 'blue', 'soccer', 10], ['2', 'm', 'green', 'tennis', 15], ['3', 'f', 'brown', 'ski', 14], ['4', 'm', 'blue', 'soccer', 10], ['5', 'm', 'green', 'tennis', 15], ['6', 'f', 'brown', 'ski', 14], ['7', 'm', 'blue', 'soccer', 10], ['8', 'f', 'green', 'tennis', 15], ['9', 'm', 'brown', 'ski', 14], ['10', 'f', 'blue', 'soccer', 10], ['11', 'm', 'green', 'tennis', 15], ['12', 'm', 'brown', 'ski', 14], ['13', 'f', 'blue', 'tennis', 10], ['14', 'm', 'green', 'ski', 15], ['15', 'f', 'green', 'soccer', 14]]
persons = [['1', 'm', 'blue', 'soccer', 10], ['2', 'm', 'green', 'tennis', 15], ['3', 'f', 'brown', 'ski', 14]]
 
# Create the pandas DataFrame
database = pd.DataFrame(database, columns=['cu_id', 'sex', 'eye_colour', 'favourite_sport', 'cash_on_account'])
persons = pd.DataFrame(persons, columns=['cu_id', 'sex', 'eye_colour', 'favourite_sport', 'cash_on_account'])

m1 = persons.merge(database, how='inner', on=['sex', 'eye_colour', 'favourite_sport']) \
    .sort_values('cash_on_account_x', ascending=False) \
    .drop_duplicates(subset='cu_id_x', keep='first') \
    .reset_index(drop=True)

persons2 = persons[~persons['cu_id'].isin(m1.cu_id_x.unique())]

m2 = persons2.merge(database, how='inner', on=['sex', 'eye_colour']) \
    .sort_values('cash_on_account_x', ascending=False) \
    .drop_duplicates(subset='cu_id_x', keep='first') \
    .reset_index(drop=True)

persons3 = persons2[~persons2['cu_id'].isin(m2.cu_id_x.unique())]

m3 = persons3.merge(database, how='inner', on=['sex']) \
    .sort_values('cash_on_account_x', ascending=False) \
    .drop_duplicates(subset='cu_id_x', keep='first') \
    .reset_index(drop=True)

m1['match_type'] = 'match_type_one'
m2['match_type'] = 'match_type_two'
m3['match_type'] = 'match_type_three'

cols = ['cu_id_x', 'match_type']

final_df = pd.concat([m1[cols], m2[cols], m3[cols]])

Upvotes: 2

Related Questions