mgadfly
mgadfly

Reputation: 57

Compare each row of Pandas df1 with every row within df2 and return string value from closest matching column

I have two data frames.

df1 includes 4 men and 4 women with their weight and height (inches).

#df1
John, 236, 76
Jack, 204, 74
Jim, 156, 71
Jared, 182, 72
Suzy, 119, 60
Sally, 149, 66
Sharon, 169, 65
Sammy, 182, 75

df2 includes 4 men and 4 women with their weight and height (inches).

#df2
Aaron, 285, 77
Abe, 236, 75
Alex, 178, 72
Adam, 195, 71
Mary, 148, 66
Maylee, 155, 66
Marilyn, 199, 65
Madison, 160, 73

What I am trying to do is have men from df1 be compared to men from df2 to see who they are most like based on height and weight. Just subtract weight from weight and height from height and return an absolute value for each man in df2. More specifically, return the name of the man most similar.

So in this case John's closest match is Abe so in a new column df1['doppelganger'] = "Abe".

I'm a beginner hobbyist so even pointing me in the right direction would be helpful. I've been looking through stack overflow for about five hours trying to figure out how to go about something like this.

Upvotes: 0

Views: 46

Answers (1)

jezrael
jezrael

Reputation: 863301

First is necessary distinguish men and women, here is used new column with repeat 4 times m and f. Then is used DataFrame.merge with outer join by new column for all combinations and created new columns for differences, last column is sum of them. then sorting by 3 columns by DataFrame.sort_values, so first row per groups by A and g are filtered by DataFrame.drop_duplicates:

df = (df1.assign(g = ['m']*4 + ['f']*4)
          .merge(df2.assign(g = ['m']*4 + ['f']*4), on='g', how='outer', suffixes=('','_'))
          .assign(dif1 = lambda x: x['B'].sub(x['B_']).abs(),
                  dif2 = lambda x: x['C'].sub(x['C_']).abs(),
                  sumdiff = lambda x: x['dif1'] + x['dif2'])
          .sort_values(['A', 'g','sumdiff'])
          .drop_duplicates(['A','g'])
          .sort_index()
          .rename(columns={'A_':'doppelganger'})
          )
print (df)
         A    B   C  g doppelganger   B_  C_  dif1  dif2  sumdiff
1     John  236  76  m          Abe  236  75     0     1        1
7     Jack  204  74  m         Adam  195  71     9     3       12
10     Jim  156  71  m         Alex  178  72    22     1       23
14   Jared  182  72  m         Alex  178  72     4     0        4
16    Suzy  119  60  f         Mary  148  66    29     6       35
20   Sally  149  66  f         Mary  148  66     1     0        1
25  Sharon  169  65  f       Maylee  155  66    14     1       15
31   Sammy  182  75  f      Madison  160  73    22     2       24

Input DataFrames:

print (df1)    
        A    B   C
0    John  236  76
1    Jack  204  74
2     Jim  156  71
3   Jared  182  72
4    Suzy  119  60
5   Sally  149  66
6  Sharon  169  65
7   Sammy  182  75

print (df2)
         A    B   C
0    Aaron  285  77
1      Abe  236  75
2     Alex  178  72
3     Adam  195  71
4     Mary  148  66
5   Maylee  155  66
6  Marilyn  199  65
7  Madison  160  73

Upvotes: 1

Related Questions