Reputation: 1530
I want to tokenize a string such as Best Beat Makers
to generate tokens per word in an almost NGram-like fashion, for example:
IN: "Best Beat Makers"
OUT: ["Best", "Beat", "Makers", "Best Beat", "Best Beat Makers"]
^ ^
| |
How can I generate these tokens?
The result should not include "Beat Makers"
because I only want to tokenize words in an compounding fashion (e.g. word1, word1 + word2, word1 + word2 + word3, etc) and not in combination (e.g. word1, word1 + word2, word2 + word3, etc).
Currently, I am only able to generate the first three tokens by using StandardTokenizerFactory
or ClassicTokenizerFactory
, and the traditional NGramTokenizerFactory
only works for characters of a word (and is a bit expensive on indexing).
One option I've considered is using StandardTokenizerFactory
to get the first three tokens and then creating a copyField
to another field that uses a PatternTokenizerFactory
with a regex defined to get the last two tokens, but I would prefer to get the tokens I need using only one field if possible.
If you are more familiar with ElasticSearch, I would still like to hear your thoughts since the tokenizers between Solr and ES are more or less similar and might push me in the right direction. Thanks!
Upvotes: 3
Views: 556
Reputation: 8658
Shingle Filter
:
This filter constructs shingles, which are token n-grams, from the token stream. It combines runs of tokens into a single token.
You use the below property as well.
maxShingleSize :
(integer, must be >= minShingleSize, default 2) The maximum number of tokens per shingle.
Here is the fieldtype applied.
<fieldType name="text_tokens" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>
</analyzer>
</fieldType>
Input is : "Welcome to Apache Solr"
The expected output is :
Unigram: "Welcome", "to", "Apache", "Solr"
Bigram: "Welcome to", "to Apache", "Apache Solr"
Trigram: "Welcome to Apache", "to Apache Solr"
Below is the analysis for you the text you shared.
Inputs is : Best Beat Makers
Upvotes: 2