Reputation: 4040
I have the following 2 rows in my dataframe:
[1, 1.1, -19, "kuku", "lulu"]
[2.8, 1.1, -20, "kuku", "lilu"]
I want to calculate their similarity by comparing each dimension (equal? 1, otherwise 0) and get the following vector: [0, 1, 0, 1, 0]
, is there any function that takes a vector and performs such "similarity" against all rows and calculates mean? In our case it would be 2/5 = 0.4
.
Upvotes: 0
Views: 95
Reputation: 26906
I would just use a simple =
on NumPy arrays, to be casted as int
for the vector and numpy.mean()
for the mean of the vector:
import numpy as np
a = [1, 1.1, -19, "kuku", "lulu"]
b = [2.8, 1.1, -20, "kuku", "lilu"]
res = (np.array(a) == np.array(b)).astype(int)
print(res)
# [0 1 0 1 0]
v = res.mean()
print(v)
# 0.4
If you do not mind computing everything twice and you can afford the potentially large intermediate temporary objects:
import numpy as np
arr = np.array([
[1, 1.1, -19, "kuku", "lulu"],
[2.8, 1.1, -20, "kuku", "lilu"],
[2.8, 1.1, -20, "kuku", "lulu"]])
corr = arr[None, :, :] == arr[:, None, :]
score = corr.mean(-1)
print(score)
# [[1. 0.4 0.6]
# [0.4 1. 0.8]
# [0.6 0.8 1. ]]
Upvotes: 1