salvezza
salvezza

Reputation: 115

hibernate search wildcard phrase query

How to configure lucene + hibernate and develop wildcard query which matches some field by any exact part of that field value? For instance if we have some field "title" indexed and only two entries for it: "My first wildcard query." and "My second wildcard query."; then if we query for "irsT WiLdCaRd q" then it has to return only first one. Also it doesn't has to be case sensitive.

I've tried something like this:

    FullTextSession ftSession = org.hibernate.search.Search.getFullTextSession((Session) em.getDelegate());
    QueryContextBuilder qbc = ftSession.getSearchFactory().buildQueryBuilder();
    EntityContext entityContext = qbc.forEntity(Book.class);
    QueryBuilder qb = entityContext.get();
    org.apache.lucene.search.Query q = qb.keyword().wildcard().onField("title")
            .ignoreAnalyzer().matching("*" + QueryParser.escape("irsT WiLdCaRd q").toLowerCase() + "*").createQuery();
    FullTextEntityManager ftEm = org.hibernate.search.jpa.Search.getFullTextEntityManager(em);
    final FullTextQuery ftq = ftEm.createFullTextQuery(q, Book.class);
    List list = ftq.getResultList();

and it doesn't work, because it's keyword oriented and there's no analog with wildcard for phrase. Using direct WildcardQuery also doesn't work(

Upvotes: 1

Views: 2813

Answers (1)

femtoRgon
femtoRgon

Reputation: 33341

Lucene does not support wildcards in phrase queries. There are strategies on how you represent the data in your index that can allow you to accomplish it.

You are treating your query as a keyword, it appears. In that case, you should really be treating the field as a keyword when indexing, as well, in which case you would be able to search the whole title as a single term. Phrases and keywords with spaces are very different things to Lucene, and you can't use them interchangeably.

The better solution, though, may be to rely on scoring to provide the best match on a set of term queries. If you just use a standard analyzer to reduce the query you indicated to a set of three terms: *irsT WiLdCaRd and q*, while both of the terms you indicated would be found, the one you want would be returned first, with a significantly higher score. You could hone the acceptable found documents somewhat, searching with required terms, like: +title:*irsT +title:WiLdCaRd +title:q*. That would eliminate any matches that do not contain all three terms, though the order of them or presence of other terms wouldn't prevent matching.

Also, another note: queries like *irst are not allowed unless you set it to enable leading wildcards. This is generally discouraged if you can avoid it. Searching with leading wildcards can be expected to be very slow, unless you've optimized your index for them (See SOLR's ReversedWildcardFilterFactory, for instance).

Upvotes: 4

Related Questions