Reputation:
How can I create columns that show the respectively similarity indices for each row?
This code
def func(name):
matches = try_test.apply(lambda row: (fuzz.partial_ratio(row['name'], name) >= 85), axis=1)
return [try_test.word[i] for i, x in enumerate(matches) if x]
try_test.apply(lambda row: func(row['name']), axis=1)
returns indices that match the condition >=85
. However, I would be interested also in having the values by comparing each field to all others.
The dataset is
try_test = pd.DataFrame({'word': ['apple', 'orange', 'diet', 'energy', 'fire', 'cake'],
'name': ['dog', 'cat', 'mad cat', 'good dog', 'bad dog', 'chicken']})
Help with be very appreciated.
Expected output (values are just an example)
word name sim_index1 sim_index2 sim_index3 ...index 6
apple dog 100 0
orange cat 100
... mad cat 0.6 100
On the diagonal there is a value of 100 as I am comparing dog with dog,... I might consider also another approach if you think it would be better.
Upvotes: 1
Views: 175
Reputation: 29635
IIUC, you can slightly change your function to get what you want:
def func(name):
return try_test.apply(lambda row: (fuzz.partial_ratio(row['name'], name)), axis=1)
print(try_test.apply(lambda row: func(row['name']), axis=1))
0 1 2 3 4 5
0 100 0 33 100 100 0
1 0 100 100 0 33 33
2 33 100 100 29 43 14
3 100 0 29 100 71 0
4 100 33 43 71 100 0
5 0 33 14 0 0 100
that said, more than half of the calculation is not necessary as the result is a symmetrical matrix and the diagonal is 100. So if you data is bigger, then you could do the partial_ratio
with the rows before the current row. Adding so reindex
and then creating the full matrix using T
(transpose) and np.diag
, you can do:
def func_pr (row):
return (try_test.loc[:row.name-1, 'name']
.apply(lambda name: fuzz.partial_ratio(name, row['name'])))
#start at index 1 (second row)
pr = (try_test.loc[1:].apply(func_pr, axis=1)
.reindex(index=try_test.index,
columns=try_test.index)
.fillna(0)
.add_prefix('sim_idx')
)
#complete the result with transpose and diag
pr += pr.to_numpy().T + np.diag(np.ones(pr.shape[0]))*100
# concat
res = pd.concat([try_test, pr.astype(int)], axis=1)
and you get
print(res)
word name sim_idx0 sim_idx1 sim_idx2 sim_idx3 sim_idx4 \
0 apple dog 100 0 33 100 100
1 orange cat 0 100 100 0 33
2 diet mad cat 33 100 100 29 43
3 energy good dog 100 0 29 100 71
4 fire bad dog 100 33 43 71 100
5 cake chicken 0 33 14 0 0
sim_idx5
0 0
1 33
2 14
3 0
4 0
5 100
Upvotes: 1