makio
makio

Reputation: 29

Sunspot/rails configuration for multi-core (for different language docs) Solr 5 in one environment

I create two cores for English and Japanese docs by Solr 5.1, and am wondering how to set up Sunspot/Rails to choose a core depending on locale selection from my rails app.

The default sunspot.yml shows a setting of one core for each production, development, and test environment, but in my case, there are two cores in one environment.

Is it possible to handle multiple cores under one environment by Sunspot?

Using URL, I can query these cores by different languages as below, so still look for a configuration to select core by locale of an user.

server:port/solr/#/EN_core/query?q=text

server:port/solr/#/JP_core/query?q='テキスト'

Upvotes: 2

Views: 743

Answers (1)

makio
makio

Reputation: 29

I figure out how to index multilingual documents in a single Solr instance and search the indexed documents by a specified language from sunspot/rails. This method uses different fields instead of cores for different languages, so it is not a direct answer to my question, but a working example to deal with multilingual documents by sunspot/solr/rails.

For example, index/search field is “description” of Entry model. Some entries have descriptions in English and the others have in Japanese. I use the language detection during the index of solr (https://cwiki.apache.org/confluence/display/solr/Detecting+Languages+During+Indexing) and copyField to deal with sunspot's behavior to add “_text” to the searchable fields.

  1. Add empty string fields “descption_en” and “desctipion_jp” to the Entry model by rails migration commands. May sound strange but these empty fields enable sunspot to search the documents either by English or Japanese. The commands may be like below, but it took quite a lot of time for > 10 million records. I should consider other methods here - https://www.onehub.com/blog/2009/09/15/adding-columns-to-large-mysql-tables-quickly/

     rails generate migration AddLanguageHolderToEntry description_en:string description_jp:string
     rake db:migrate
    
  2. Add searchable to the Entry model

    class Entry < ActiveRecord::Base
       searchable do
          text :description, :description_en, :description_ja
       end
    end
    
  3. Configure solrconfig.xml to enable Solr the language detection during indexing.

Adding the following updateRequestProcessorChain. Using “description_text” in langid.fl instead of “description” because Sunspot adds “_text” to field name.

 <updateRequestProcessorChain name="langid">
   <processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
     <bool name="langid">true</bool>
     <str name="langid.fl">description_text</str>
     <str name="langid.whitelist">en,ja</str>
     <bool name="langid.map">true</bool>
     <str name="langid.langField">language</str>
     <str name="langid.fallback">en</str>
   </processor>
   <processor class="solr.LogUpdateProcessorFactory" />
   <processor class="solr.RunUpdateProcessorFactory" />
 </updateRequestProcessorChain>

I also added langid to the requestHandlers of “/update” and "/update/extract" as follows.

<requestHandler name="/update" class="solr.UpdateRequestHandler">
 <lst name="defaults">
   <str name="update.chain">langid</str>
 </lst>
</requestHandler>

<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">ignored_</str>
    <str name="captureAttr">true</str>
    <str name="fmap.a">links</str>
    <str name="fmap.div">ignored_</str>
    <str name="update.chain">langid</str>
</lst>
</requestHandler>

Check paths to the libraries

  <lib dir="/path to/contrib/langid/lib/" regex=".*\.jar" />
  <lib dir="/path to/dist/" regex="solr-langid-\d.*\.jar" />
  1. Configure schema.xml

Add fields for “description”. “_text_en” and “_text_jp” are for the outputs from the solr's language detection. “_en_text” and “_jp_text” for indexing/searching by sunspot.

   <field name="name_text_en" type="text_en" indexed="false" stored="true"/>
   <field name="name_en_text" type="text_en" indexed="true" stored="false"/>

   <field name="name_text_ja" type="text_ja" indexed="false" stored="true"/>
   <field name="name_ja_text" type="text_ja" indexed="true" stored="false"/>

For the detected language.

These copyfields are set for searching.

<copyField source="description_text_en" dest="description_en_text" />
<copyField source="description_text_ja" dest="description_ja_text" />

Need “text_en” and “text_ja” filedtypes in the schema.xml. I omit details configuration for them here, but use standard analyzers.

<fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">.....
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">.....
  1. Make indexing from sunspot

    bundle exec rake sunspot:reindex
    
  2. Search document – for test.

    rails console
    

for English documents -

@search =  Entry.search do
   fulltext 'keyword_en' do
     fields(:description_en)
   end
end

for Japanese documents -

@search =  Entry.search do
   fulltext 'キーワード' do
     fields(:description_ja)
   end
end

@search.results

As you see that this is ad-hoc method and welcome any comments on it.

Upvotes: 1

Related Questions