Graham S.
Graham S.

Reputation: 1530

Solr - How to tokenize words in a string in a compounding "word-1, word-1 + word-2, word-1 + word-2 ... word-n" manner?

I want to tokenize a string such as Best Beat Makers to generate tokens per word in an almost NGram-like fashion, for example:

IN:  "Best Beat Makers"
OUT: ["Best", "Beat", "Makers", "Best Beat", "Best Beat Makers"]
                                     ^               ^
                                     |               |
                              How can I generate these tokens?

The result should not include "Beat Makers" because I only want to tokenize words in an compounding fashion (e.g. word1, word1 + word2, word1 + word2 + word3, etc) and not in combination (e.g. word1, word1 + word2, word2 + word3, etc).

Currently, I am only able to generate the first three tokens by using StandardTokenizerFactory or ClassicTokenizerFactory, and the traditional NGramTokenizerFactory only works for characters of a word (and is a bit expensive on indexing).

One option I've considered is using StandardTokenizerFactory to get the first three tokens and then creating a copyField to another field that uses a PatternTokenizerFactory with a regex defined to get the last two tokens, but I would prefer to get the tokens I need using only one field if possible.

If you are more familiar with ElasticSearch, I would still like to hear your thoughts since the tokenizers between Solr and ES are more or less similar and might push me in the right direction. Thanks!

Upvotes: 3

Views: 556

Answers (1)

Abhijit Bashetti
Abhijit Bashetti

Reputation: 8658

Shingle Filter : This filter constructs shingles, which are token n-grams, from the token stream. It combines runs of tokens into a single token.

You use the below property as well.

maxShingleSize : (integer, must be >= minShingleSize, default 2) The maximum number of tokens per shingle.

Here is the fieldtype applied.

<fieldType name="text_tokens" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>
        </analyzer>
    </fieldType>

Input is : "Welcome to Apache Solr"

The expected output is :

Unigram: "Welcome", "to", "Apache", "Solr"
Bigram: "Welcome to", "to Apache", "Apache Solr"
Trigram: "Welcome to Apache", "to Apache Solr"

Below is the analysis for you the text you shared.

Inputs is : Best Beat Makers

image

Upvotes: 2

Related Questions