Aditya Singh
Aditya Singh

Reputation: 29

Painless scripting not working with nested field of type array

I have this mapping

{
  "properties": {
    "doc_id": {
      "type": "keyword"
    },
    "repo": {
      "type": "keyword"
    },
    "commit_hash": {
      "type": "keyword"
    },
    "path": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword"
        },
        "path": {
          "type": "text",
          "analyzer": "path_analyzer"
        }
      }
    },
    "language": {
      "type": "keyword"
    },
    "last_indexed_at": {
      "type": "date"
    },
    "chunks": {
      "type": "nested",
      "properties": {
        "chunk_id": {
          "type": "keyword"
        },
        "embedding": {
          "type": "knn_vector",
          "dimension": 1024,
          "method": {
            "name": "hnsw",
            "space_type": "cosinesimil",
            "engine": "lucene"
          }
        },
        "text": {
          "type": "text",
          "analyzer": "default_analyzer"
        },
        "code": {
          "type": "text",
          "analyzer": "code_analyzer"
        },
        "titles": {
          "type": "text",
          "analyzer": "title_analyzer"
        },
        "start_line": {
          "type": "integer",
          "index": false
        },
        "end_line": {
          "type": "integer",
          "index": false
        }
      }
    }
  }
}

Search query:

def _build_search_query(self, query: str, embedding: List[float], size: int) -> Dict[str, Any]:
        return {
            "size": size,
            "query": {
                "nested": {
                    "path": "chunks",
                    "query": {
                        "script_score": {
                            "query": {"match_all": {}},
                            "script": {
                                "source": self.scoring_script,
                                "params": {
                                    "query_vector": embedding
                                },
                                "lang": "painless"
                            }
                        }
                    }
                }
            },
            "_source": ["repo", "path", "language", "chunks", "commit_hash", "doc_id"]
        }

How do I access embedding field inside chunks (which is an array) to do cosine similarity?

For testing, I was using this painless script

if (doc.containsKey("chunks.embedding")) {
    return doc['chunks.embedding'].length
} else {
    return 0.0;
}

But this always returns 1.0 while my embeddings are of size 1024. Have tried multiple combinations doc['chunks']['embedding'] etc. But nothing seems to work. I Ideally want to iterate on embeddings for all chunks and aggregate the score somehow.

Upvotes: 0

Views: 40

Answers (0)

Related Questions