Searching with Hibernate Search ignoring UTF-8 chars

Question

I've just implemented full text search engine based on Hibernate Search under the hood.

I'm searching solution for one issue. I have texts with Polish (UTF-8) characters, like: "zażółć gęślą jaźń". When I'm searching for "jaźń" everything is OK and result is found. But when I'm searching for "jazn" the result is not found.

I would like to search for all possible terms: "jaźń", "jazń", "jaźn", and "jazn" and find the searched "zażółć gęślą jaźń" text. How can I configure Hibernate Search to do so?

Guillaume Smet · Accepted Answer

You have to define an analyzer to analyze your text before indexing/querying.

See the Hibernate Search documentation section 1.8, on analyzers, and for more complete information on analysis, section 4.3

To fix your issue, the analyzer you define have to include the ASCIIFoldingFilter which transforms non-ASCII characters to their nearest ASCII equivalent (and probably the LowerCaseFilter too). See this example

If you are using the Hibernate Search DSL to build your queries, it's done automatically. If you build your queries with stock Lucene, you have an example here, which binds the analyzer automatically to the fields.

Note that wildcard queries are not analyzed by default, so if you use wildcards, you'll need to clean up your string before passing it to the query.

You can see an example of how to sanitize your queries for wildcard here.

It uses ASCIIFoldingFilter underneath with this sort of code.

Searching with Hibernate Search ignoring UTF-8 chars

Answers (1)

Related Questions