Steven Matthews
Steven Matthews

Reputation: 11275

Tokenizing content of a field differently depending on language - Apache Solr

I've got a field in Apache Solr called "content", and the field is currently being indexed/tokenized as an English language field, which isn't always true, sometimes it has Japanese.

Is there anyway to process this field differently depending on language? Perhaps if there's an fq="language:japanese (pseudocode) or something like that?

What's the best way to allow processing for multiple languages on a single field.

We've currently got a second field with the same content that is set to Japanese language, but we'd really like the processing to go on this one field.

Upvotes: 0

Views: 228

Answers (1)

EricLavault
EricLavault

Reputation: 16035

Have a look to the Solr LanguageDetection feature. It supports automatic renaming / mapping of fields according to detected language and other advanced parameters.

In your case, an idea would be to map content to content_en and content_ja according to the language detected into content. Here an example of the UpdateRequestProcessor definition in solrconfig.xml :

 <processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
   <bool name="langid">true</bool>
   <str name="langid.fl">content</str>  <!-- list of fields to be processed -->
   <str name="langid.langField">language</str>  <!-- where goes returned langcode -->
   <str name="langid.whitelist">en,ja</str>  <!-- what language to detect -->
   <bool name="langid.map">true</bool>  <!-- mapping langcode (add _suffix) -->
   <str name="langid.map.lcmap">en_GB:en en_US:en</str>  <!-- custom mapping -->
 </processor>

You will have to update schema.xml so that both content_en and content_ja are defined, and ensure that they are binded to the corresponding fieldtype for indexation.

Upvotes: 1

Related Questions