Solr Analyzer PatternReplaceCharFilterFactory is not taken in consideration. (maybe cause of ngram or multivalued)

Question

here is my problem.

I have to normalized address data to strip out th or st. string example: 35 West 15th Street

I can not just use synonym cause the th/st are part of the "word" 15th so I need to use the solr.PatternReplaceCharFilterFactory

here is my schema entries:

my field is multivalued cause I also include the building_name and other text.

it seems that the PatternReplaceCharFilterFactory works when I try it with the admin interface -> analyze. cause I get this result when I test with "35 West 15th Street"

PRCF text 35 West 15 Street

for both, query and index.

but when I query I get this output: "building_search_text": [ "259 West 15th Street, 259 West 15th Street", "259 West 15th Street" ],

At query time it also doesn't working as expected. Query: item_type:Building AND building_search_text:(35 West 15th Street)

Here is the output of the query debug: (the th is not stripped) "debug": { "rawquerystring": "item_type:Building AND building_search_text:(35 West 15th Street)", "querystring": "item_type:Building AND building_search_text:(35 West 15th Street)", "parsedquery": "+item_type:Building +(building_search_text:35 building_search_text:west building_search_text:15th building_search_text:street)", "parsedquery_toString": "+item_type:Building +(building_search_text:35 building_search_text:west building_search_text:15th building_search_text:street)",

I'm not sure if it's a bug that could be related to multivalued field of if I'm doing something wrong.

someone have an Idea?

Okke Klein · Accepted Answer

Why not use a http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory (splitOnNumerics="1") so streetnames like 22nd and 3rd are also split into a number and letter part?

Solr Analyzer PatternReplaceCharFilterFactory is not taken in consideration. (maybe cause of ngram or multivalued)

Answers (2)

Related Questions