Reputation: 11
I have a database(currently a json file) of keywords and their embedding data that i created with openAI's embedding. What i am trying to do is a similarity search with the input keyword. So In my current flow when i enter a keyword firstly it is embedded and its compared with the embeddings of data from my database. What i have observed is when i add ten-thousands of data the similarity finding process takes too much time. What is a good solution i can implement to get faster results even when there is millions of embedding data in my database.
I have seen about implementing faiss by meta. But not sure if that is the proper solution Below is the similarity function i currently use
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
similarities = [cosine_similarity(embedding, embedding_query) for embedding in embeddings]
Currently embedding data is read from a json file like below
with open("data_a_b_j.json", "r") as f:
embedding_data = json.load(f)
# Extract embeddings and corresponding descriptions from the loaded data
embeddings = [item["embedding"] for item in embedding_data]
Upvotes: 1
Views: 155