treyfromthebay
treyfromthebay

Reputation: 1

How to make a Hibernate SearchSession return results with unique attributes?

I am working on using the Hibernate SearchSession class in Java to perform a search against a database, the code I currently have to search a table looks something like this:

SearchSession searchSession = Search.session(entityManagerFactory.unwrap(SessionFactory.class).withOptions()
                .tenantIdentifier("locations").openSession());

SearchResult<Location> result = searchSession.search(Location.class)
                .where(  f -> f.bool()
                        .must( f.match()
                                .field("locationName")
                                .matching((phrase)).fuzzy())
                ).fetch(page * limit, limit);

This search works and properly returns results from the database, but there is no uniqueness constraint on the locationName column and the database holds multiple records with the same value in locationName. As a result, when we try to display them on the UI of the application it looks like there are duplicate values, even though they're unique in the database.

Is there a way to make a SearchSession only return a result if another result with an identical value (such as locationName) has not been returned before? Applying a uniqueness constraint to the database table isn't an option in this scenario, and we were hoping there's a way to handle filtering out duplicate values in the session over taking the results from the search and removing duplicate values separately.

Upvotes: 0

Views: 569

Answers (1)

yrodiere
yrodiere

Reputation: 9977

Is there a way to make a SearchSession only return a result if another result with an identical value (such as locationName) has not been returned before?

Not really, at least not at the moment.

If you're using the Elasticsearch backend and are fine with going native, you can insert native JSON into the Elasticsearch request, in particular collapsing.

I think something like this might work:

SearchResult<Location> result = searchSession.search( Location.class )
        .extension( ElasticsearchExtension.get() ) 
        .where(  f -> f.bool()
                        .must( f.match()
                                .field("locationName")
                                .matching((phrase)).fuzzy())
                )
        .requestTransformer( context -> { 
            JsonObject collapse = new JsonObject();
            collapse.addProperty("field", "locationName_keyword")

            JsonObject body = context.body(); 
            body.add( "collapse", collapse );
        } )
        // You probably need a sort, as well:
        .sort(f -> f.field("id"))
        .fetch( page * limit, limit ); 

You will need to add a locationName_keyword field to your Location entity:

@Indexed
@Entity
public class Location {

    // ...

    @Id
    @GenericField(sortable = Sortable.YES) // Add this
    private Long id;

    // ...

    @FullTextField
    @KeywordField(name = "locationName_keyword", sortable = Sortable.YES) // Add this
    private String locationName;

    // ...

}

(You may need to also assign a custom normalizer to the locationName_keyword field, if the duplicate locations have a slightly different locationName (different case, ...))

Note however that the "total hit count" in the Search result will indicate the number of hits before collapsing. So if there's only one matching locationName, but 5 Location instances with that name, the total hit count will be 5, but users will only see one hit. They'll be confused for sure.


That being said, it might be worth having another look at your situation to determine whether collapsing is really necessary here:

As a result, when we try to display them on the UI of the application it looks like there are duplicate values, even though they're unique in the database.

If you have multiple documents with the same locationName, then surely you have multiple rows in the database with the same locationName? Duplication doesn't appear spontaneously when indexing.

I would say the first thing to do would be to step back, and consider whether you really want to query the Location entity, or if another, related entity wouldn't make more sense. When two locations have the same name, do they have a relationship to another, common entity instance (e.g. of type Shop, ...)?

=> If so, you should probably query that entity type instead (.search(Shop.class)), and take advantage of @IndexedEmbedded to allow filtering based on Location properties (i.e. add @IndexedEmbedded to the location association in the Shop entity type, then use the field location.locationName when adding a predicate that should match the location name).

If there is no such related, common entity instance, then I would try to find out why locations are duplicated exactly, and more importantly why that duplication makes sense in the database, but not to users:

  • Are the users not interested in all the locations? Then maybe you should add another filter to your query (by "type", ...) that would help remove duplicates. If necessary, you could even run multiple search queries: first one with very strict filters, and if there are no hits, fall back to another one with less strict filters.
  • Are you using some kind of versioning or soft deletion? Then maybe you should avoid indexing soft-deleted entities or older versions; you can do that with conditional indexing or, if that doesn't work, with a filter in your search query.

If your data really is duplicated (legacy database, ...) without any way to pick a duplicate over another except by "just picking the first one", you could consider whether you need an aggregation instead of full-blown search. Are you just looking for the top location names, or maybe a count of locations by name? Then aggregations are the right tool.

Upvotes: 1

Related Questions