Reputation: 594
I have two spark DataFrames A
and B
with the same schema. They contain text and the embedding vector of the text pre-calculated using a model such as OpenAI ADA v2 or similar. Example:
id text embedding
1 ... [0.123, ...]
2 ... [0.456, ...]
3 ... [0.789, ...]
...
I would like to do an approximate N-nearest neighbor join using cosine similarity. I want to find, for each row of A
, the top N rows of B
that match it based on that metric. The output should look like this (for N=3):
A B text_A text_B rank cosine_distance
1 12 ... ... 1 0.164
1 37 ... ... 2 0.346
1 125 ... ... 3 0.532
2 12 ... ... 1 0.016
2 123 ... ... 2 0.095
2 567 ... ... 3 0.123
...
Optionally, being able to define a minimum similarity would also be welcome.
The goal is basically to find the most similar pieces of text in B
for each of the rows in A
.
Since my datasets are large, I would like to avoid having to do a cross join, calculate all distances and filter them with a window function, as that would be very expensive and require a lot of memory. I would rather have something approximate using an approach such as Locality-sensitive hashing, probably something similar to MinHashLSH.
I couldn't find a way to do this (not for cosine similarity at least) in spark ML. Is there any decent way?
Upvotes: 2
Views: 33