Piotr Pradzynski
Piotr Pradzynski

Reputation: 4535

Searching with Hibernate Search ignoring UTF-8 chars

I've just implemented full text search engine based on Hibernate Search under the hood.

I'm searching solution for one issue. I have texts with Polish (UTF-8) characters, like: "zażółć gęślą jaźń". When I'm searching for "jaźń" everything is OK and result is found. But when I'm searching for "jazn" the result is not found.

I would like to search for all possible terms: "jaźń", "jazń", "jaźn", and "jazn" and find the searched "zażółć gęślą jaźń" text. How can I configure Hibernate Search to do so?

Upvotes: 1

Views: 1123

Answers (1)

Guillaume Smet
Guillaume Smet

Reputation: 10539

You have to define an analyzer to analyze your text before indexing/querying.

See the Hibernate Search documentation section 1.8, on analyzers, and for more complete information on analysis, section 4.3

To fix your issue, the analyzer you define have to include the ASCIIFoldingFilter which transforms non-ASCII characters to their nearest ASCII equivalent (and probably the LowerCaseFilter too). See this example

If you are using the Hibernate Search DSL to build your queries, it's done automatically. If you build your queries with stock Lucene, you have an example here, which binds the analyzer automatically to the fields.

Note that wildcard queries are not analyzed by default, so if you use wildcards, you'll need to clean up your string before passing it to the query.

You can see an example of how to sanitize your queries for wildcard here.

It uses ASCIIFoldingFilter underneath with this sort of code.

Upvotes: 7

Related Questions