zebra
zebra

Reputation: 1338

lucene collation

we are using lucene on .net, and we need a way to implement a search which is "collation agnostic" I do not know if this is the right term but what we need is if I have a user called [Žuf] I want to be able to find him by etering [zuf] and also in other direction if the user name is [zuf] and I enter [Žuf] I still wanted to find him, there is always a manual way of striping all characters and crating index on this, but I would prefer soemthing smarter

any ides on this?

thanks almir

Upvotes: 1

Views: 589

Answers (2)

Guillaume Vauvert
Guillaume Vauvert

Reputation: 471

Lucene for Java contains a filter that do the job : ICUFoldingFilter (http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/analysis/icu/ICUFoldingFilter.html), in the maven package lucene-icu (at least in version 3.6.1). I don't know if such a library does exist for Lucene.net, but, as it is based on ICU, you should be able to rewrite the code in .NET.

What ICUFoldingFilter is :

A TokenFilter that applies search term folding to Unicode text, applying foldings from UTR#30 Character Foldings.

This filter applies the following foldings from the report to unicode text:

- Accent removal
- Case folding
- Canonical duplicates folding
- Dashes folding
- Diacritic removal (including stroke, hook, descender)
- Greek letterforms folding
- Han Radical folding
- Hebrew Alternates folding
- Jamo folding
- Letterforms folding
- Math symbol folding
- Multigraph Expansions: All
- Native digit folding
- No-break folding
- Overline folding
- Positional forms folding
- Small forms folding
- Space folding
- Spacing Accents folding
- Subscript folding
- Superscript folding
- Suzhou Numeral folding
- Symbol folding
- Underline folding
- Vertical forms folding
- Width folding

Additionally, Default Ignorables are removed, and text is normalized to NFKC. All foldings, case folding, and normalization mappings are applied recursively to ensure a fully folded and normalized result.

Upvotes: 2

Jf Beaulac
Jf Beaulac

Reputation: 5246

Take a look at ASCIIFoldingFilter, combined with a LowerCaseFilter it should do what you need.

Upvotes: 1

Related Questions