Reputation: 1556
I realised that one can get top terms from solr using following API:
localhost:8983/solr/admin/luke?fl=text&numTerms=5000&wt=json
But this just gives a list of top unigrams (e.g."David"), NOT bigrams (e.g. "David Beckham"), trigrams etc
Is there a way I can fetch from Solr, a list of top bigrams, trigrams etc ?
Upvotes: 1
Views: 1397
Reputation:
Ion has the right idea but you should use a shingle filter. For example:
<fieldType name="ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="5" outputUnigrams="true"
outputUnigramsIfNoShingles="false" tokenSeparator=" "/>
</analyzer>
</fieldType>
<field name="ngrams" type="ngram" indexed="true" stored="false" required="false" multiValued="true" />
Then use the terms component against this field:
http://localhost:8983/solr/sample/terms?terms.fl=ngrams
Upvotes: 1
Reputation: 2583
One can declare field type with the Ngram filter like:
<fieldType
name="myNGram"
stored="false"
class="solr.StrField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="5"/>
</analyzer>
</fieldType>
and then declare a copy field of type myNGram
<field name="ngrams" type="myNGram" indexed="true" stored="false" required="false" />
<copyField source="doc_text" dest="ngrams"/>
assuming that the document text is located in doc_test
field.
localhost:8983/solr/admin/luke?fl=ngrams&numTerms=5000&wt=json
This will mix will give you the top ngrams of length 2 to 5. If you want just the bigrams you can restrict maxGramSize
paramter of the NGramFilterFactory to 2.
Upvotes: 2