Reputation: 135
here is my problem.
I have to normalized address data to strip out th or st. string example: 35 West 15th Street
I can not just use synonym cause the th/st are part of the "word" 15th so I need to use the solr.PatternReplaceCharFilterFactory
here is my schema entries:
<fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([0-9]{1,})(st |th |ST |TH )" replacement="$1 " />
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="15" />
<!--filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true"
/-->
<!--filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/-->
</analyzer>
<analyzer type="query">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([0-9]{1,})(st |th |ST |TH )" replacement="$1 " />
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<!--filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /-->
<!--filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/-->
</analyzer>
</fieldType>
<field name="building_search_text" type="text_ngram" indexed="true" stored="true" multiValued="true"/>
my field is multivalued cause I also include the building_name and other text.
it seems that the PatternReplaceCharFilterFactory works when I try it with the admin interface -> analyze. cause I get this result when I test with "35 West 15th Street"
PRCF text 35 West 15 Street
for both, query and index.
but when I query I get this output: "building_search_text": [ "259 West 15th Street, 259 West 15th Street", "259 West 15th Street" ],
At query time it also doesn't working as expected. Query: item_type:Building AND building_search_text:(35 West 15th Street)
Here is the output of the query debug: (the th is not stripped) "debug": { "rawquerystring": "item_type:Building AND building_search_text:(35 West 15th Street)", "querystring": "item_type:Building AND building_search_text:(35 West 15th Street)", "parsedquery": "+item_type:Building +(building_search_text:35 building_search_text:west building_search_text:15th building_search_text:street)", "parsedquery_toString": "+item_type:Building +(building_search_text:35 building_search_text:west building_search_text:15th building_search_text:street)",
I'm not sure if it's a bug that could be related to multivalued field of if I'm doing something wrong.
someone have an Idea?
Upvotes: 0
Views: 898
Reputation: 135
here is the response to my own problem.
I've use the wrong tokenizer.
here is the new fieldType definition:
<fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([0-9]{1,})(st|th)\s?" replacement="$1 " replace="all" />
<filter class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="15" />
<!--filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true" /-->
<!--filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/-->
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([0-9]{1,})(st|th)\s?" replacement="$1 " replace="all" />
<!--filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /-->
<!--filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/-->
</analyzer>
</fieldType>
Upvotes: 0
Reputation: 2549
Why not use a http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory (splitOnNumerics="1") so streetnames like 22nd and 3rd are also split into a number and letter part?
Upvotes: 1