Reputation: 1511
I have a dataframe that shows for every row a human pair (name_1 and name_2) together with the corresponding score. The score is a numeric value and represents how well these two people fit together. The higher the score, the better the match between person 1 (name_1) and person 2 (name_2).
As you can see, some names can be found twice or more. Of course, one person can only be matched once. My goal is, to find as many pairs in the dataframe as possible and write each of them into a second dataframe.
The problem that makes me struggle is this:
I think I can get max. 8 pairs out of the dataframe since I have 8 different names in the first column. Unfortunately, the scores for best matches are not clearly separated. One person can match with multiple other persons whereas other persons only can match to one specific person. I am not much interested in the matching-score. I am interested in not loosing any person due to a bad choice of pair combination.
I a looking for a way to find and extract as many pairs of of the dataframe.
This is the dataframe df:
name_1 name_2 score
27 allen jolly 1.8
23 anna rock 2.8
22 anna christina 1.1
26 christina rock 2.3
24 christina allen 1.4
25 christina jolly 1.4
18 emily rock 3.7
15 emily sabastein 3.3
16 emily anna 2.5
17 emily christina 2.4
4 jacob jolly 3.4
1 jacob rick 2.9
3 jacob allen 2.4
0 jacob mary 2.3
2 jacob christina 2.0
7 mary jolly 1.7
5 mary rick 1.4
6 mary christina 1.3
14 rick rock 2.8
9 rick sabastein 2.8
8 rick emily 2.5
13 rick jolly 2.3
11 rick christina 2.1
10 rick anna 2.0
12 rick allen 1.5
21 sabastein rock 3.6
19 sabastein anna 2.8
20 sabastein christina 1.9
I think the best matching in terms of total maximum score is:
emely rock 3.7
jacob jolly 3.4
sabastein anna 2.8
rick allen 1.5
mary christina 1.3
I am not absolutely sure if this is also the maximum number of pairs I can get. If you know how to get best pairs (see above) or maximum number of pairs, I would be really happy to see.
Upvotes: 0
Views: 300
Reputation: 10860
EDIT
In the meantime I found a very convenient function to create a graph from a dataframe, but you should rename your column score
to weight
for this:
The you could simply write:
G = nx.from_pandas_edgelist(df, 'name_1', 'name_2', 'weight')
mate = nx.max_weight_matching(G)
and that's it.
(Rest is still part of our discussion below, how you process the result further on...)
My approach would be
import pandas as pd
import networkx as nx
df['edges'] = df.apply(lambda r: (r.name_1, r.name_2, {'weight': r.score}), axis=1)
G = nx.Graph()
allnames = set(df.loc[:, ['name_1', 'name_2']].values.flatten())
for s in allnames:
G.add_node(s)
G.add_edges_from(df.edges)
mate = nx.max_weight_matching(G)
Result:
res = pd.DataFrame(list(mate), columns=['name_1', 'name_2'])
res['score'] = res.apply(lambda r: G[r[0]][r[1]]['weight'], axis=1)
print(res)
print(f'\nMatchings: {len(res)}\nTotal Score: {res.score.sum():.1f}')
# name_1 name_2 score
#0 rock emily 3.7
#1 rick christina 2.1
#2 mary jacob 2.3
#3 sabastein anna 2.8
#4 jolly allen 1.8
#Matchings: 5
#Total Score: 12.7
DocSources:
For setting up the Graph you already had the correct link.
For the maximum_matching
function see here https://networkx.github.io/documentation/networkx-1.10/reference/generated/networkx.algorithms.matching.max_weight_matching.html#networkx.algorithms.matching.max_weight_matching
Upvotes: 1