How to implement correctly my custom tokenizer in Lucene/Solr?

Question

I'm using Solr 5.5.2 and Lucene 5.5.2 under the hood.

What I'm trying to do is to create my custom tokenizer, which divides text by slash symbol.

Here is the code sample:

public class SlashSymbolTokenizer extends CharTokenizer {

public SlashSymbolTokenizer() {
}

public SlashSymbolTokenizer(AttributeFactory factory) {
    super(factory);
}

@Override
protected boolean isTokenChar(int c) {
    return c != 47 && c != 92;
}
}

schema.xml

And after applying this tokenizer at index time for a field "color":"Black/white" I assume that I could match it further by querying something like "color":"black white", but it doesn't work.. this field matches only by the initial value "Black/white"

What is wrong with my implementation ? Do you have any ideas ?

Thanks a lot!

MatsLindh · Accepted Answer

Since your tokenizer only tokenizes on /, the "black white" query will be a single token with the content black white. Seing as that token does not match either black or white, no match is found.

If you want to tokenize on whitespace as well as /, you can either handle that in your own code, or possibly use something like a WordDelimiterFilter. You can also use a WhitespaceTokenizer and use a custom delimiter list to WordDelimiterFilter to split by /, or you can use a PatternTokenizer to provide your own set of regular expression to split the text by (using for example both / and whitespace.

Use the analysis page under the solr admin to see exactly how your field is processed and tokenized.

How to implement correctly my custom tokenizer in Lucene/Solr?

Answers (1)

Related Questions