Reputation: 45
i am working on a project to perform multilingual full-text search using Elasticsearch. The historical training dataset i am using is also multilingual and i am trying now to configure text analysis with language analyzer and language detection.
1) i am using the following link as a guide and as it is written in the first paragraph i need to install an Inference Ingest Processor. How can i install it? (i am not familiar with Java and new in elasticsearch) https://www.elastic.co/de/blog/multilingual-search-using-language-identification-in-elasticsearch
2) Elasticsearch offers language Analyzer for many languages i will need to configure Analyzers in 8 languages if i follow this link https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html i will have to create 8 different custom analyzers which is quite long. Is there any shorter way to write one setting for 8 languages?
Upvotes: 2
Views: 1986
Reputation: 32376
First as mentioned in the blog an Inference Ingest Processor
is a Machine learning(ML) feature and unless you have a use case you don't need it, also it's part of X-pack and not the core Elasticsearch so you might have to enable X-pack module and buy if it's not included in the basic tier of X-pack.
Coming to your second question, As mentioned in the blog two approaches one is having a separate index for each language that way you don't have to define all the language-specific field and, Second way which we use is the one field for each language and all the languages will the part of the same index.
There is no overhead of maintaining 8 custom analyzers as most of the analyzer are inbuilt, you can check Elasticsearch language analyzers which all are supported in your use case. And others, if you have to create it, will be just one-time effort and would be part of your setting and mapping.
Below is one example index mapping of per field approach where I am using the inbuilt analyzer of most common languages.
{
"mappings": {
"properties": {
"en": {
"type": "text",
"analyzer": "english"
},
"russian": {
"type": "text",
"analyzer": "russian"
},
"spanish": {
"type": "text",
"analyzer": "spanish"
},
"swedish": {
"type": "text",
"analyzer": "swedish"
}
}
}
}
Upvotes: 1