Reputation: 252
I have a sample dataframe which is I uploaded to my Github Gist (because it has 98 rows, but the original data has millions). It has 4 numerical columns, 1 ID column and 1 column which indicates its cluster ID. I have written a function which I apply to that dataframe in two ways:
individual
and apply the functionindividual
and cluster
and apply the function.Here is the function in question:
def vectorized_similarity_filtering2(df, cols = ["scaledPrice", "scaledAirlines", "scaledFlights", "scaledTrip"]):
from sklearn.metrics.pairwise import cosine_similarity
arr = df[cols].to_numpy()
b = arr[..., None]
c = arr.T[None, ...]
# they must less than equal
mask = (((b <= c).all(axis=1)) & ((b < c).any(axis=1)))
mask |= mask.T
sims = np.where(mask, np.nan, cosine_similarity(arr))
return np.sum(sims >= 0.6, axis = 1)
What it does in few steps:
By logic, each element of the result of applying to all rows for every individual
(case A) must be not less than the each element of the result of applying to all rows for every individual
and cluster
(case B). Because, case B . However, I see that case B has more elements than case A for some rows. It does not make sense to me, because Case B has less elements to compare to each other. I hope somebody can explain my what is wrong with the code, or my understanding?
Here are steps to replicate the results:
# df being the dataframe
g = df.groupby("individual")
gc = df.groupby(["individual", "cluster"])
caseA = np.concatenate(g.apply(lambda x: vectorized_similarity_filtering2(x)).values)
caseB = np.concatenate(gc.apply(lambda x: vectorized_similarity_filtering2(x)).values)
caseA >= caseB
array([ True, True, True, True, True, True, True, False, False,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, False,
False, True, True, True, True, True, True, True, True,
True, True, True, True, False, True, True, True, True,
True, True, True, True, True, True, False, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True])
EDIT: formatting
Upvotes: 0
Views: 106
Reputation: 371
The culprit is the order of the cluster groupby which is currently looping through the clusters in this order [0, 2, 1, 5, 3, 4, 11, 6, 7, 12, 8, 9, 10]
. This means that the elements aren't aligned in the comparison caseA >= caseB
so you are comparing the similarity of different rows to each other.
One solution is to sort your dataframe first so that your function on the cluster groupby returns values the same order as on the individual groupby like this
df = df.sort_values(by=['cluster'])
Then it should work!
Upvotes: 2