Francesco Pasa
Francesco Pasa

Reputation: 594

Spark approximate N-nearest neightbor join using cosine similarity

I have two spark DataFrames A and B with the same schema. They contain text and the embedding vector of the text pre-calculated using a model such as OpenAI ADA v2 or similar. Example:

id  text   embedding
1   ...    [0.123, ...] 
2   ...    [0.456, ...]
3   ...    [0.789, ...]
...

I would like to do an approximate N-nearest neighbor join using cosine similarity. I want to find, for each row of A, the top N rows of B that match it based on that metric. The output should look like this (for N=3):

A  B    text_A  text_B  rank  cosine_distance
1  12   ...     ...     1     0.164
1  37   ...     ...     2     0.346
1  125  ...     ...     3     0.532
2  12   ...     ...     1     0.016
2  123  ...     ...     2     0.095
2  567  ...     ...     3     0.123
...

Optionally, being able to define a minimum similarity would also be welcome.

The goal is basically to find the most similar pieces of text in B for each of the rows in A.

Since my datasets are large, I would like to avoid having to do a cross join, calculate all distances and filter them with a window function, as that would be very expensive and require a lot of memory. I would rather have something approximate using an approach such as Locality-sensitive hashing, probably something similar to MinHashLSH.

I couldn't find a way to do this (not for cosine similarity at least) in spark ML. Is there any decent way?

Upvotes: 2

Views: 33

Answers (0)

Related Questions