membersound
membersound

Reputation: 86905

Ignore special characters with lucene?

I'm trying to create a lucene search index for a bunch of names. I want to be able to search the names neglecting the case, umlauts, special chars, whitespaces and so on.

Ideally, querying for Robert or Rober Roberts should match R'obert Röbertson.

Which Analyzers or Filters to I have to apply in apache lucene to achieve this?

So far I'm using new StandardAnalyzer(Version.LUCENE_4_9), but that is tied to exact matches.

And moreover: how can I chain analyzers? Because an IndexWriter only takes a single analyzer:

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_9);
new IndexWriterConfig(Version.LUCENE_4_9, analyzer);

Upvotes: 2

Views: 1890

Answers (1)

epoch
epoch

Reputation: 16615

There is possibly a standard way of doing this, but all I can think of is storing a 'sanitized' version in a special (different) field with something like this:

String normalized = Normalizer.normalize(string, Form.NFD)
        .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");

or just straight up removing special characters with regex:

String normalized = string.replaceAll("[^A-Za-z]+", "");

and then adding the normalized field to the index:

    final Document document = new Document();
    document.add(new Field("fieldName", normalized, Store.YES, Index.ANALYZED));

in doing this, your normal content would still be the same, but lucene will be able to search the normalized field as well

UPDATE

Ok, so for the normalisation you are going to need multiple steps, first removing diacritical and then special characters:

String normalized = Normalizer.normalize(string, Form.NFD)
        .replaceAll("\\p{InCombiningDiacriticalMarks}+", "")
        .replaceAll("[^A-Za-z ]+", ""); // <-- note the space

So for input R'obert Röbertson, the above returns Robert Robertson

Upvotes: 1

Related Questions