Reputation: 3463
I want to use the solr keepwordfilterfactory
but not getting the appropriate tokenizer for that. Use case is, i have a string say hi i am coming, bla-bla go out.
Now from the following string i want to keep the words like hi i
, coming,
,bla-bla
etc. So what tokenizer to use with the filter factory so that i am able to get any such combination in facets. Tried different tokenizer but not getting the exact result. I am using solr 4.0
. Is there any such tokenizer that tokenizes based on the keepwords used.
Upvotes: 2
Views: 874
Reputation: 9789
What are your 'rules' for tokenization (splitting long text into individual tokens). The example above seem to be implying that sometimes you have single word tokens and sometimes a multi-word ("hi i"). The multi-word case is problematic here, but you might be able to do it by combining ShingleFilterFactory to give you multi-word tokens as well as the original ones and then you keep only the items you want.
I am not sure whether KeepWord filter deals correctly with multi-word strings. If it does not, you may want to have a special separator character during shingle process and then regex filter it back to space as the last step.
Upvotes: 1