Salman
Salman

Reputation: 9447

How to tokenize phrase in Solr and get facets

I want something to tokenize the data based on the given phrases in the .txt file. Normally the facet query gives me white space tokenized facets. But I want the result to be like this.

for e.g my data is "aaa bbb-ccc ddd eee" for the field "test_data" the facets should be like this

<lst name="test_data">
    <int name="aaa">1</int>
    <int name="bbb-ccc">1</int>
    <int name="ddd eee">1</int>
</lst>

and somefile.txt will have "bbb-ccc" & "ddd eee" as phrases

Thanks

Upvotes: 1

Views: 791

Answers (2)

Salman
Salman

Reputation: 9447

I just found out that KeepWordFilterFactory can do the job. I added this fieldtype in the schema

<fieldType name="text_keepword" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.KeepWordFilterFactory" words="keepwords.txt" ignoreCase="false" enablePositionIncrements="false"/>
    </analyzer>
</fieldType>

and this field

<field name="keep_fld" type="text_keepword" indexed="true" stored="true"/>

Upvotes: 1

Lukasvan3L
Lukasvan3L

Reputation: 731

If you don't want to build your own Tokenizer, you could use the PatternTokenizer:

For example, you have a list of terms, delimited by a semicolon and zero or more spaces: mice; kittens; dogs.

<fieldType name="semicolonDelimited" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.PatternTokenizerFactory" pattern=";\s*" />
  </analyzer>
</fieldType>

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternTokenizerFactory

This way you can add your own regex with bbb-ccc in it.

Upvotes: 0

Related Questions