Reputation: 5793
So I have a hundred or so inedexes with different information from various sources. all the data is embeded using adda2.
Now I'm trying to iterate over the list of indexes and query each:
index_client = SearchIndexClient(
endpoint=pierre_itfunds_endpoint, credential=pierre_itfunds_credential
)
index_client.list_indexes()
rows_list = []
for index in indexes:
search_client = SearchClient(search_service_endpoint, index, credential)
vector_query = VectorizedQuery(vector=search_vector, k_nearest_neighbors=3, fields="content_vector")
results = search_client.search(
search_text=query,
vector_queries= [vector_query],
select=["title", "text"],
top=3
)
for row in results:
dict1 = {}
dict1.update({'index':index, 'score':row['@search.score'], 'title':row['title'], 'text':row['text']})
rows_list.append(dict1)
res = pd.DataFrame(rows_list)
then I get the average score for each index:
grouped = res.groupby('index')['score'].agg(['mean'])
grouped
However the resulting avg score doesnt seem consistent:
index_name | avg_score |
---|---|
worng-idx | 8.3763725667 |
other1-idx | 4.5701991333 |
other2-idx | 4.2485168 |
other3-idx | 3.5756512667 |
CORRECT-idx | 2.5451367667 |
... | ... |
I had hopped that using the same embbedder for all the content the cosine distance would be consistent across inedexes... which says nothing about the score, but still.
Is there a way to do this search, or normalize the scores so the highest has the most relevant answers?
Upvotes: 0
Views: 63