user12907213
user12907213

Reputation:

Create columns having similarity index values

How can I create columns that show the respectively similarity indices for each row?

This code

def func(name):
    matches = try_test.apply(lambda row: (fuzz.partial_ratio(row['name'], name) >= 85), axis=1)
    return [try_test.word[i] for i, x in enumerate(matches) if x]

try_test.apply(lambda row: func(row['name']), axis=1)

returns indices that match the condition >=85. However, I would be interested also in having the values by comparing each field to all others.

The dataset is

try_test = pd.DataFrame({'word': ['apple', 'orange', 'diet', 'energy', 'fire', 'cake'], 
                         'name': ['dog', 'cat', 'mad cat', 'good dog', 'bad dog', 'chicken']})

Help with be very appreciated.

Expected output (values are just an example)

    word       name        sim_index1 sim_index2 sim_index3 ...index 6
  apple         dog             100       0
  orange        cat                      100 
 ...           mad cat                   0.6           100

On the diagonal there is a value of 100 as I am comparing dog with dog,... I might consider also another approach if you think it would be better.

Upvotes: 1

Views: 175

Answers (1)

Ben.T
Ben.T

Reputation: 29635

IIUC, you can slightly change your function to get what you want:

def func(name):
    return try_test.apply(lambda row: (fuzz.partial_ratio(row['name'], name)), axis=1)

print(try_test.apply(lambda row: func(row['name']), axis=1))
     0    1    2    3    4    5
0  100    0   33  100  100    0
1    0  100  100    0   33   33
2   33  100  100   29   43   14
3  100    0   29  100   71    0
4  100   33   43   71  100    0
5    0   33   14    0    0  100

that said, more than half of the calculation is not necessary as the result is a symmetrical matrix and the diagonal is 100. So if you data is bigger, then you could do the partial_ratio with the rows before the current row. Adding so reindex and then creating the full matrix using T (transpose) and np.diag, you can do:

def func_pr (row):
    return (try_test.loc[:row.name-1, 'name']
                    .apply(lambda name: fuzz.partial_ratio(name, row['name'])))

#start at index 1 (second row)
pr = (try_test.loc[1:].apply(func_pr, axis=1)
         .reindex(index=try_test.index, 
                  columns=try_test.index)
         .fillna(0)
         .add_prefix('sim_idx')
     )

#complete the result with transpose and diag
pr += pr.to_numpy().T + np.diag(np.ones(pr.shape[0]))*100

# concat
res = pd.concat([try_test, pr.astype(int)], axis=1)

and you get

print(res)
     word      name  sim_idx0  sim_idx1  sim_idx2  sim_idx3  sim_idx4  \
0   apple       dog       100         0        33       100       100   
1  orange       cat         0       100       100         0        33   
2    diet   mad cat        33       100       100        29        43   
3  energy  good dog       100         0        29       100        71   
4    fire   bad dog       100        33        43        71       100   
5    cake   chicken         0        33        14         0         0   

   sim_idx5  
0         0  
1        33  
2        14  
3         0  
4         0  
5       100  

Upvotes: 1

Related Questions