Reputation: 9447
I want something to tokenize the data based on the given phrases in the .txt file. Normally the facet query gives me white space tokenized facets. But I want the result to be like this.
for e.g my data is "aaa bbb-ccc ddd eee" for the field "test_data" the facets should be like this
<lst name="test_data">
<int name="aaa">1</int>
<int name="bbb-ccc">1</int>
<int name="ddd eee">1</int>
</lst>
and somefile.txt will have "bbb-ccc" & "ddd eee" as phrases
Thanks
Upvotes: 1
Views: 791
Reputation: 9447
I just found out that KeepWordFilterFactory can do the job. I added this fieldtype in the schema
<fieldType name="text_keepword" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.KeepWordFilterFactory" words="keepwords.txt" ignoreCase="false" enablePositionIncrements="false"/>
</analyzer>
</fieldType>
and this field
<field name="keep_fld" type="text_keepword" indexed="true" stored="true"/>
Upvotes: 1
Reputation: 731
If you don't want to build your own Tokenizer, you could use the PatternTokenizer:
For example, you have a list of terms, delimited by a semicolon and zero or more spaces: mice; kittens; dogs.
<fieldType name="semicolonDelimited" class="solr.TextField">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern=";\s*" />
</analyzer>
</fieldType>
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternTokenizerFactory
This way you can add your own regex with bbb-ccc in it.
Upvotes: 0