Reputation: 57
I have two data frames.
df1 includes 4 men and 4 women with their weight and height (inches).
#df1
John, 236, 76
Jack, 204, 74
Jim, 156, 71
Jared, 182, 72
Suzy, 119, 60
Sally, 149, 66
Sharon, 169, 65
Sammy, 182, 75
df2 includes 4 men and 4 women with their weight and height (inches).
#df2
Aaron, 285, 77
Abe, 236, 75
Alex, 178, 72
Adam, 195, 71
Mary, 148, 66
Maylee, 155, 66
Marilyn, 199, 65
Madison, 160, 73
What I am trying to do is have men from df1 be compared to men from df2 to see who they are most like based on height and weight. Just subtract weight from weight and height from height and return an absolute value for each man in df2. More specifically, return the name of the man most similar.
So in this case John's closest match is Abe so in a new column
df1['doppelganger'] = "Abe"
.
I'm a beginner hobbyist so even pointing me in the right direction would be helpful. I've been looking through stack overflow for about five hours trying to figure out how to go about something like this.
Upvotes: 0
Views: 46
Reputation: 863301
First is necessary distinguish men
and women
, here is used new column with repeat 4 times m
and f
. Then is used DataFrame.merge
with outer join by new column for all combinations and created new columns for differences, last column is sum
of them. then sorting by 3 columns by DataFrame.sort_values
, so first row per groups by A
and g
are filtered by DataFrame.drop_duplicates
:
df = (df1.assign(g = ['m']*4 + ['f']*4)
.merge(df2.assign(g = ['m']*4 + ['f']*4), on='g', how='outer', suffixes=('','_'))
.assign(dif1 = lambda x: x['B'].sub(x['B_']).abs(),
dif2 = lambda x: x['C'].sub(x['C_']).abs(),
sumdiff = lambda x: x['dif1'] + x['dif2'])
.sort_values(['A', 'g','sumdiff'])
.drop_duplicates(['A','g'])
.sort_index()
.rename(columns={'A_':'doppelganger'})
)
print (df)
A B C g doppelganger B_ C_ dif1 dif2 sumdiff
1 John 236 76 m Abe 236 75 0 1 1
7 Jack 204 74 m Adam 195 71 9 3 12
10 Jim 156 71 m Alex 178 72 22 1 23
14 Jared 182 72 m Alex 178 72 4 0 4
16 Suzy 119 60 f Mary 148 66 29 6 35
20 Sally 149 66 f Mary 148 66 1 0 1
25 Sharon 169 65 f Maylee 155 66 14 1 15
31 Sammy 182 75 f Madison 160 73 22 2 24
Input DataFrames:
print (df1)
A B C
0 John 236 76
1 Jack 204 74
2 Jim 156 71
3 Jared 182 72
4 Suzy 119 60
5 Sally 149 66
6 Sharon 169 65
7 Sammy 182 75
print (df2)
A B C
0 Aaron 285 77
1 Abe 236 75
2 Alex 178 72
3 Adam 195 71
4 Mary 148 66
5 Maylee 155 66
6 Marilyn 199 65
7 Madison 160 73
Upvotes: 1