Reputation: 91
I have two datasets (df_persons and df_database). Both of them have the same structure:
cu_id | sex | eye_colour | favourite_sport | cash_on_account |
---|---|---|---|---|
1 | m | blue | soccer | 15 |
2 | f | green | tennis | 25 |
3 | m | brown | ski | 33 |
(much more rows with various combinations of sex, eye_colour and favourite_sport)
For each individuals of the rows/cu_ids in persons, I'm looking for a similar match in database.
There are certain rules to follow:
Loop through every row/cu_id from persons
In other words, we are looking (in another table) for the most similar user for persons. Every user from the database can only be used once. The order of comparison is important (only if sex+eye_colour+favourite_sport are matching, it's a match - otherwise only if sex+eye_colour or even just sex. Matching sex+favourite_sport is NO valid match).
import pandas as pd
database = [['1', 'm', 'blue', 'soccer', 10], ['2', 'm', 'green', 'tennis', 15], ['3', 'f', 'brown', 'ski', 14], ['4', 'm', 'blue', 'soccer', 10], ['5', 'm', 'green', 'tennis', 15], ['6', 'f', 'brown', 'ski', 14], ['7', 'm', 'blue', 'soccer', 10], ['8', 'f', 'green', 'tennis', 15], ['9', 'm', 'brown', 'ski', 14], ['10', 'f', 'blue', 'soccer', 10], ['11', 'm', 'green', 'tennis', 15], ['12', 'm', 'brown', 'ski', 14], ['13', 'f', 'blue', 'tennis', 10], ['14', 'm', 'green', 'ski', 15], ['15', 'f', 'green', 'soccer', 14]]
persons = [['1', 'm', 'blue', 'soccer', 10], ['2', 'm', 'green', 'tennis', 15], ['3', 'f', 'brown', 'ski', 14]]
# Create the pandas DataFrame
database = pd.DataFrame(db, columns=['cu_id', 'sex', 'eye_colour', 'favourite_sport', 'cash_on_account'])
persons = pd.DataFrame(data, columns=['cu_id', 'sex', 'eye_colour', 'favourite_sport', 'cash_on_account'])
I simply can't wrap my head around that problem, without using a for loop and extensive comparisons/filtering (especially because I'm just allowed to match every user from database just once).
Is there any guidance you could offer for problems like that?
Best regards, worky
Upvotes: 0
Views: 749
Reputation: 640
Would merging these dataframes on your conditions in the given order and only checking non matched ones in the next merge work for you? Like this:
database = [['1', 'm', 'blue', 'soccer', 10], ['2', 'm', 'green', 'tennis', 15], ['3', 'f', 'brown', 'ski', 14], ['4', 'm', 'blue', 'soccer', 10], ['5', 'm', 'green', 'tennis', 15], ['6', 'f', 'brown', 'ski', 14], ['7', 'm', 'blue', 'soccer', 10], ['8', 'f', 'green', 'tennis', 15], ['9', 'm', 'brown', 'ski', 14], ['10', 'f', 'blue', 'soccer', 10], ['11', 'm', 'green', 'tennis', 15], ['12', 'm', 'brown', 'ski', 14], ['13', 'f', 'blue', 'tennis', 10], ['14', 'm', 'green', 'ski', 15], ['15', 'f', 'green', 'soccer', 14]]
persons = [['1', 'm', 'blue', 'soccer', 10], ['2', 'm', 'green', 'tennis', 15], ['3', 'f', 'brown', 'ski', 14]]
# Create the pandas DataFrame
database = pd.DataFrame(database, columns=['cu_id', 'sex', 'eye_colour', 'favourite_sport', 'cash_on_account'])
persons = pd.DataFrame(persons, columns=['cu_id', 'sex', 'eye_colour', 'favourite_sport', 'cash_on_account'])
m1 = persons.merge(database, how='inner', on=['sex', 'eye_colour', 'favourite_sport']) \
.sort_values('cash_on_account_x', ascending=False) \
.drop_duplicates(subset='cu_id_x', keep='first') \
.reset_index(drop=True)
persons2 = persons[~persons['cu_id'].isin(m1.cu_id_x.unique())]
m2 = persons2.merge(database, how='inner', on=['sex', 'eye_colour']) \
.sort_values('cash_on_account_x', ascending=False) \
.drop_duplicates(subset='cu_id_x', keep='first') \
.reset_index(drop=True)
persons3 = persons2[~persons2['cu_id'].isin(m2.cu_id_x.unique())]
m3 = persons3.merge(database, how='inner', on=['sex']) \
.sort_values('cash_on_account_x', ascending=False) \
.drop_duplicates(subset='cu_id_x', keep='first') \
.reset_index(drop=True)
m1['match_type'] = 'match_type_one'
m2['match_type'] = 'match_type_two'
m3['match_type'] = 'match_type_three'
cols = ['cu_id_x', 'match_type']
final_df = pd.concat([m1[cols], m2[cols], m3[cols]])
Upvotes: 2