Reputation: 1556

Retrieve Ngram list with frequencies from Solr

I realised that one can get top terms from solr using following API:
localhost:8983/solr/admin/luke?fl=text&numTerms=5000&wt=json
But this just gives a list of top unigrams (e.g."David"), NOT bigrams (e.g. "David Beckham"), trigrams etc
Is there a way I can fetch from Solr, a list of top bigrams, trigrams etc ?

Upvotes: 1

Answers (2)

user404345

Reputation:

Ion has the right idea but you should use a shingle filter. For example:

<fieldType name="ngram" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="5" outputUnigrams="true"
                outputUnigramsIfNoShingles="false" tokenSeparator=" "/>
    </analyzer>
</fieldType>

<field name="ngrams" type="ngram" indexed="true" stored="false" required="false" multiValued="true" />

Then use the terms component against this field:

http://localhost:8983/solr/sample/terms?terms.fl=ngrams

Upvotes: 1

Ion Cojocaru

Reputation: 2583

One can declare field type with the Ngram filter like:

<fieldType 
   name="myNGram" 
   stored="false" 
   class="solr.StrField"> 
 <analyzer type="index"> 
   <tokenizer class="solr.StandardTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/> 
   <filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="5"/> 
 </analyzer> 
</fieldType>

and then declare a copy field of type myNGram

<field name="ngrams" type="myNGram" indexed="true" stored="false" required="false" />

<copyField source="doc_text" dest="ngrams"/>

assuming that the document text is located in doc_test field.

localhost:8983/solr/admin/luke?fl=ngrams&numTerms=5000&wt=json

This will mix will give you the top ngrams of length 2 to 5. If you want just the bigrams you can restrict maxGramSize paramter of the NGramFilterFactory to 2.

Upvotes: 2

Retrieve Ngram list with frequencies from Solr

Answers (2)

Related Questions