user1370289
user1370289

Reputation: 135

Solr Analyzer PatternReplaceCharFilterFactory is not taken in consideration. (maybe cause of ngram or multivalued)

here is my problem.

I have to normalized address data to strip out th or st. string example: 35 West 15th Street

I can not just use synonym cause the th/st are part of the "word" 15th so I need to use the solr.PatternReplaceCharFilterFactory

here is my schema entries:

<fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([0-9]{1,})(st |th |ST |TH )" replacement="$1 " />
            <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="15" />
            <!--filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
                enablePositionIncrements="true"
            /-->
            <!--filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/-->
           </analyzer>
        <analyzer type="query">
            <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([0-9]{1,})(st |th |ST |TH )" replacement="$1 " />
            <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <!--filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /-->
            <!--filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/-->
        </analyzer>
    </fieldType>

<field name="building_search_text" type="text_ngram" indexed="true" stored="true" multiValued="true"/>

my field is multivalued cause I also include the building_name and other text.

it seems that the PatternReplaceCharFilterFactory works when I try it with the admin interface -> analyze. cause I get this result when I test with "35 West 15th Street"

PRCF text 35 West 15 Street

for both, query and index.

but when I query I get this output: "building_search_text": [ "259 West 15th Street, 259 West 15th Street", "259 West 15th Street" ],

At query time it also doesn't working as expected. Query: item_type:Building AND building_search_text:(35 West 15th Street)

Here is the output of the query debug: (the th is not stripped) "debug": { "rawquerystring": "item_type:Building AND building_search_text:(35 West 15th Street)", "querystring": "item_type:Building AND building_search_text:(35 West 15th Street)", "parsedquery": "+item_type:Building +(building_search_text:35 building_search_text:west building_search_text:15th building_search_text:street)", "parsedquery_toString": "+item_type:Building +(building_search_text:35 building_search_text:west building_search_text:15th building_search_text:street)",

I'm not sure if it's a bug that could be related to multivalued field of if I'm doing something wrong.

someone have an Idea?

Upvotes: 0

Views: 898

Answers (2)

user1370289
user1370289

Reputation: 135

here is the response to my own problem.

I've use the wrong tokenizer.

here is the new fieldType definition:

<fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.PatternReplaceFilterFactory" pattern="([0-9]{1,})(st|th)\s?" replacement="$1 " replace="all" />
            <filter class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="15" />
            <!--filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true" /-->
            <!--filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/-->
           </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.PatternReplaceFilterFactory" pattern="([0-9]{1,})(st|th)\s?" replacement="$1 " replace="all" />
            <!--filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /-->
            <!--filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/-->
        </analyzer>
</fieldType>

Upvotes: 0

Okke Klein
Okke Klein

Reputation: 2549

Why not use a http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory (splitOnNumerics="1") so streetnames like 22nd and 3rd are also split into a number and letter part?

Upvotes: 1

Related Questions