Reputation: 61
I have been trying to use the synonym.txt file and the SynonymFilterFactory that ships out of the box with SOLR Lucene, with Indian Languages (Hindi for POC) but it doesn't seem to work as it works for English.
Found this here on stack overflow which raises a similar question but has no resolution, yet.
I have already taken care of the following to support Indian Language Search with SOLR,
1. Changed Browser Encoding to UTF-8
2. Added URIEncodings=UTF-8 in server.xml of Acapche Tomcat Server.
For the POC I have tried out the following things,
1. Created a new Field Type to support Hindi indexing,
<fieldType name="text_hi" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<!-- normalizes unicode representation -->
<filter class="solr.IndicNormalizationFilterFactory"/>
<!-- normalizes variation in spelling -->
<filter class="solr.HindiNormalizationFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_hi.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.HindiStemFilterFactory"/>
</analyzer>
</fieldType>
UPDATE I also tried removing the stemming after going through the responses by @Mysterion and @Alexandre Rafalovitch on this post,
<fieldtype name="text_hi_rev" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.WhitespaceTokenizerFactory"/>
</analyzer>
Defined a new Field based on the created field type,
Added the following line to the synonyms.txt file,
india,bharat,भारत , हिन्दुस्तान ,hindustan
Indexed the following strings as part different documents,
मैं भारत का रहने वाला हूँ मैं हिसंदुस्तान का रहने वाला हूँ मैं india का रहने वाला हूँ मैं hindustan का रहने हूँ मैं bharat का रहने हूँ
Expected Behaviour :
When I search for any of the keywords india,bharat,भारत , हिन्दुस्तान ,hindustan I should get all the documents indexed in Step-4
Actual Behaviour:
1. When searching with keywords india, hindustan,or bharat I get the following results,
मैं india का रहने वाला हूँ
मैं hindustan का रहने हूँ
मैं bharat का रहने हूँ
Any pointers if what I am trying is even possible? If possible, what could I be doing wrong here?
Thanks.
Upvotes: 1
Views: 1073
Reputation: 61
After a lot of frustrating hours and help from @Mysterion, I accidentally stumbled upon the solution. Here are the two steps that led to the resolution,
Upvotes: 1