Reputation: 1338
we are using lucene on .net, and we need a way to implement a search which is "collation agnostic" I do not know if this is the right term but what we need is if I have a user called [Žuf] I want to be able to find him by etering [zuf] and also in other direction if the user name is [zuf] and I enter [Žuf] I still wanted to find him, there is always a manual way of striping all characters and crating index on this, but I would prefer soemthing smarter
any ides on this?
thanks almir
Upvotes: 1
Views: 589
Reputation: 471
Lucene for Java contains a filter that do the job : ICUFoldingFilter (http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/analysis/icu/ICUFoldingFilter.html), in the maven package lucene-icu (at least in version 3.6.1). I don't know if such a library does exist for Lucene.net, but, as it is based on ICU, you should be able to rewrite the code in .NET.
What ICUFoldingFilter is :
A TokenFilter that applies search term folding to Unicode text, applying foldings from UTR#30 Character Foldings.
This filter applies the following foldings from the report to unicode text:
- Accent removal - Case folding - Canonical duplicates folding - Dashes folding - Diacritic removal (including stroke, hook, descender) - Greek letterforms folding - Han Radical folding - Hebrew Alternates folding - Jamo folding - Letterforms folding - Math symbol folding - Multigraph Expansions: All - Native digit folding - No-break folding - Overline folding - Positional forms folding - Small forms folding - Space folding - Spacing Accents folding - Subscript folding - Superscript folding - Suzhou Numeral folding - Symbol folding - Underline folding - Vertical forms folding - Width folding
Additionally, Default Ignorables are removed, and text is normalized to NFKC. All foldings, case folding, and normalization mappings are applied recursively to ensure a fully folded and normalized result.
Upvotes: 2
Reputation: 5246
Take a look at ASCIIFoldingFilter, combined with a LowerCaseFilter it should do what you need.
Upvotes: 1