Reputation: 63

solr stemming, stop words and shingles not giving expected outputs

I am trying to remove the unwanted words and use stemming and finally create shingles. However, after removing stop words, its giving me shingles with "_" in the place of stop words. I tried using PatternReplaceFactory to replace _ but its not working. I have field type as below:

<fieldType name="common_shingle" class="solr.TextField">
    <analyzer type="index">
          <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern=".*_.*" replacement=""/>
        <filter class="solr.ShingleFilterFactory" outputUnigrams="false" minShingleSize="3" maxShingleSize="3"/>            
    </analyzer>
</fieldType>

And when I analyse "A brown fox quickly jumps over the lazy dog". It gives me following result:

_ brown fox
brown fox quickli
fox quickli jump
quickli jump _
jump _ _
_ _ lazi
_ lazi dog

How do I remove _ from the shingle token. Also, is there a way to create shingles only from stop words?

Upvotes: 0

Answers (3)

Edans Sandes

Reputation: 11

In the SOLR's Jira there is an improvement request with an available patch: https://issues.apache.org/jira/browse/SOLR-11604

Compile a new lucene-analyzers-common.jar with this patch and use the skipFillerTokens="true" option in your schema.xml

<filter class="solr.ShingleFilterFactory" ... skipFillerTokens="true"/>

If you want this patch to be included in the next SOLR version, vote for this Jira issue.

Upvotes: 1

Balu

Reputation: 11

Thats because of stopwords Set PositionIncrements to False and luceneMatchVersion to 4.3

Replace your StopFilterFactory with this.

  <filter class="solr.StopFilterFactory" luceneMatchVersion="4.3" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>

Upvotes: 1

MatsLindh

Reputation: 52832

The _ is inserted by the ShingleFilter, as it replaces empty position increments with the token _.

If you want to remove the value, you'll have to perform the PatternReplace after the ShingleFilter, as it doesn't exist in the token stream before that.

ElasticSearch exposes an option to select the replacement character as "fillter_token", but Solr's implementation seem to directly use the Lucene implementation, so you should be able to use fillerToken to set this yourself. Try doing fillerToken="" in your ShingleFilter definition, instead of using the patternreplacefilter.

Upvotes: 0

solr stemming, stop words and shingles not giving expected outputs

Answers (3)

Related Questions