Reputation: 11275
I've got a field in Apache Solr called "content", and the field is currently being indexed/tokenized as an English language field, which isn't always true, sometimes it has Japanese.
Is there anyway to process this field differently depending on language? Perhaps if there's an fq="language:japanese
(pseudocode) or something like that?
What's the best way to allow processing for multiple languages on a single field.
We've currently got a second field with the same content that is set to Japanese language, but we'd really like the processing to go on this one field.
Upvotes: 0
Views: 228
Reputation: 16035
Have a look to the Solr LanguageDetection feature. It supports automatic renaming / mapping of fields according to detected language and other advanced parameters.
In your case, an idea would be to map content
to content_en
and content_ja
according to the language detected into content
. Here an example of the UpdateRequestProcessor definition in solrconfig.xml :
<processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
<bool name="langid">true</bool>
<str name="langid.fl">content</str> <!-- list of fields to be processed -->
<str name="langid.langField">language</str> <!-- where goes returned langcode -->
<str name="langid.whitelist">en,ja</str> <!-- what language to detect -->
<bool name="langid.map">true</bool> <!-- mapping langcode (add _suffix) -->
<str name="langid.map.lcmap">en_GB:en en_US:en</str> <!-- custom mapping -->
</processor>
You will have to update schema.xml so that both content_en
and content_ja
are defined, and ensure that they are binded to the corresponding fieldtype for indexation.
Upvotes: 1