Reputation: 63
I am trying to remove the unwanted words and use stemming and finally create shingles. However, after removing stop words, its giving me shingles with "_" in the place of stop words. I tried using PatternReplaceFactory to replace _ but its not working. I have field type as below:
<fieldType name="common_shingle" class="solr.TextField">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.PatternReplaceFilterFactory" pattern=".*_.*" replacement=""/>
<filter class="solr.ShingleFilterFactory" outputUnigrams="false" minShingleSize="3" maxShingleSize="3"/>
</analyzer>
</fieldType>
And when I analyse "A brown fox quickly jumps over the lazy dog". It gives me following result:
How do I remove _ from the shingle token. Also, is there a way to create shingles only from stop words?
Upvotes: 0
Views: 783
Reputation: 11
In the SOLR's Jira there is an improvement request with an available patch: https://issues.apache.org/jira/browse/SOLR-11604
Compile a new lucene-analyzers-common.jar with this patch and use the skipFillerTokens="true" option in your schema.xml
<filter class="solr.ShingleFilterFactory" ... skipFillerTokens="true"/>
If you want this patch to be included in the next SOLR version, vote for this Jira issue.
Upvotes: 1
Reputation: 11
Thats because of stopwords Set PositionIncrements to False and luceneMatchVersion to 4.3
Replace your StopFilterFactory with this.
<filter class="solr.StopFilterFactory" luceneMatchVersion="4.3" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
Upvotes: 1
Reputation: 52832
The _
is inserted by the ShingleFilter, as it replaces empty position increments with the token _
.
If you want to remove the value, you'll have to perform the PatternReplace after the ShingleFilter, as it doesn't exist in the token stream before that.
ElasticSearch exposes an option to select the replacement character as "fillter_token", but Solr's implementation seem to directly use the Lucene implementation, so you should be able to use fillerToken
to set this yourself. Try doing fillerToken=""
in your ShingleFilter definition, instead of using the patternreplacefilter.
Upvotes: 0