Computing similarity of multiple fields with Lucene

Question

For the context, I'm making a retriever of similar chess positions. For the given position encoding, I extract some attributes for that position, which I then compare to other positions in my database to get some of the most similar ones. To speed up that process (in the end, the database will have to be huge), I'm using (Python version of) Lucene; to convert all games' positions to indexes. These are the fields I'm using, to create each index:

doc = Document()
    doc.add(Field('FEN', attributes['FEN'], StringField.TYPE_STORED))
    doc.add(Field('static_attributes', static_attr, TextField.TYPE_STORED))
    doc.add(Field('pawn_structure', pawn_structure, TextField.TYPE_STORED))
    doc.add(Field('center', center_types, TextField.TYPE_STORED))
    doc.add(Field('dynamic_attributes', dynamic_attr, TextField.TYPE_STORED))
    doc.add(Field('game_id', game_id, StringField.TYPE_STORED))
    doc.add(Field('id', str(writer.numDocs()), StringField.TYPE_STORED))

When the new position is given, its attributes are queried (four fields that are stored as TextField). The most similar ones are returned, with similarity score, which is computed with BM25 similarity function. This is a part of the query function:

single_attributes = ['static_attributes', 'pawn_structure', 'center', 'dynamic_attributes']

for attr in single_attributes:
    indexReader = DirectoryReader.open(indexDir)
    indexSearcher = IndexSearcher(indexReader)
    indexSearcher.setSimilarity(BM25Similarity())

    # Create a TopScoreDocCollector to retrieve results
    collector = TopScoreDocCollector.create(20)

    analyzer = WhitespaceAnalyzer()
    parser = QueryParser(attr, analyzer)
    queryString = attributes[attr]
    query = parser.parse(queryString)

    # Perform the search and collect the results in the collector
    indexSearcher.search(query, collector)

    # Retrieve the topDocs from the collector
    topDocs = collector.topDocs()
    hits = topDocs.scoreDocs

My problem is: when I query each attribute, I get the most similar indexes and their score for just this attribute (which is expected). But I would need a somehow combined similarity for all the queried fields. And I'm not sure how to achieve this. My first idea was to obtain the most similar document to currently queried attribute's id, then obtain that document via id and just compare other three fields with BM25. But the problem is, that BM25 works only with query; I couldn't just give it two strings and get their similarities. Then I wanted to move this specific document to another folder and make a query again, with just this one document in the folder; which was again a bad idea because BM25 computes the similarity compared to all the other indexes and if there is just one, the similarity is always the same. My latest idea was that, after I obtained the id, I run the query again, but this time i obtain all of the documents in database:

newCollector = TopScoreDocCollector.create(num_of_all_docs)

And after that I filter them to get just the one with specific id. I couldn't make it work just yet, because instead of obtaining all of the documents, it obtains just part of them; I suspect that the one I'm searching for isn't similar enough to be collected. But again, I don't like this idea because it doesn't seem logical to get all of the documents, which will probably be time consuming with the bigger database.

I'm having a hard time understanding Lucene's documentation, so I mostly drew inspiration from other code examples, but I still didn't find the problem similar to mine. If someone please could explain to me how to construct a query, where similarities for all four fields are calculated and/or combined, or how to do that in another way, that would be great. Hope the used terminology and everything else makes sense; I could also post more code if needed. Thanks for the help!

**update:

I've constructed a booleanquery where i combine all of the attributes like this:

static_parser = QueryParser('static_attributes', analyzer)
pawn_parser = QueryParser('pawn_structure', analyzer)
center_parser = QueryParser('center', analyzer)
dynamic_parser = QueryParser('dynamic_attributes', analyzer)

queryString = (
    f"(static_attributes:{attributes['static_attributes']}) AND "
    f"(pawn_structure:{attributes['pawn_structure']}) AND "
    f"(center:{attributes['center']}) AND "
    f"(dynamic_attributes:{attributes['dynamic_attributes']})"
)

combined_query = BooleanQuery.Builder()
static_query = static_parser.parse(attributes['static_attributes'])
pawn_query = pawn_parser.parse(attributes['pawn_structure'])
center_query = center_parser.parse(attributes['center'])
dynamic_query = dynamic_parser.parse(attributes['dynamic_attributes'])

# Add the parsed queries to the BooleanQuery
combined_query.add(static_query, BooleanClause.Occur.MUST)
combined_query.add(pawn_query, BooleanClause.Occur.MUST)
combined_query.add(center_query, BooleanClause.Occur.MUST)
combined_query.add(dynamic_query, BooleanClause.Occur.MUST)

combined_query = combined_query.build()

But now I'm not getting any results at all.

Computing similarity of multiple fields with Lucene

Answers (0)

Related Questions