Vector Embedding using Spark for compute

Question

I have some large parquet files of data in Iceberg (which I have stored using Spark). My objective now is to pull these down using Spark, convert them into a spark dataframe, perform vector embedding to transform the dataframe into a new dataframe with the embedded vector columns, and then store this vector-column into a vector database like qdrant.

I have had problems making things work so far, and online documentation on this specific topic is limited. I tried Spark NLP, but it appears incompatible with the qdrant-spart connector I used to allow qdrant to be a target for Spark. So I guess I am looking for what the conventional way is to do the following two:

Perform vector embedding on a Spark dataframe using a model like BERT (Word2Vec is insufficient for my needs), extending the dataframe with a vector column.
Take the produces vector-embeddings column and store it in a vector database like qdrant.

I feel like the distributed nature of Spark is a big obstacle here.

Vector Embedding using Spark for compute

Answers (0)

Related Questions