Reputation: 29
I have this mapping
{
"properties": {
"doc_id": {
"type": "keyword"
},
"repo": {
"type": "keyword"
},
"commit_hash": {
"type": "keyword"
},
"path": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
},
"path": {
"type": "text",
"analyzer": "path_analyzer"
}
}
},
"language": {
"type": "keyword"
},
"last_indexed_at": {
"type": "date"
},
"chunks": {
"type": "nested",
"properties": {
"chunk_id": {
"type": "keyword"
},
"embedding": {
"type": "knn_vector",
"dimension": 1024,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "lucene"
}
},
"text": {
"type": "text",
"analyzer": "default_analyzer"
},
"code": {
"type": "text",
"analyzer": "code_analyzer"
},
"titles": {
"type": "text",
"analyzer": "title_analyzer"
},
"start_line": {
"type": "integer",
"index": false
},
"end_line": {
"type": "integer",
"index": false
}
}
}
}
}
Search query:
def _build_search_query(self, query: str, embedding: List[float], size: int) -> Dict[str, Any]:
return {
"size": size,
"query": {
"nested": {
"path": "chunks",
"query": {
"script_score": {
"query": {"match_all": {}},
"script": {
"source": self.scoring_script,
"params": {
"query_vector": embedding
},
"lang": "painless"
}
}
}
}
},
"_source": ["repo", "path", "language", "chunks", "commit_hash", "doc_id"]
}
How do I access embedding field inside chunks (which is an array) to do cosine similarity?
For testing, I was using this painless script
if (doc.containsKey("chunks.embedding")) {
return doc['chunks.embedding'].length
} else {
return 0.0;
}
But this always returns 1.0 while my embeddings are of size 1024. Have tried multiple combinations doc['chunks']['embedding'] etc. But nothing seems to work. I Ideally want to iterate on embeddings for all chunks and aggregate the score somehow.
Upvotes: 0
Views: 40