Pravin Thokal
Pravin Thokal

Reputation: 51

Lucene Indexing strategy with MultiLingual Support

We are using Lucene.net for searching in our application , we do it in good manner, Now We need to support multiple language so I would like to ask what strategy we should use for indexing like, indexing different languages in different index folder with different analyzer , same index folder having documents, of English language and other languages fields (We end up having too many fields bt repetition of fields per language) or is there any other alternative ? Pravin Thokal

Upvotes: 2

Views: 976

Answers (1)

aditrip
aditrip

Reputation: 314

The ideal strategy would be to have an additional language field and other existing fields can take in content in many languages. The value of language field dynamically selects different language analyzers for the multilingual fields. But in essence, one field will have contents in many languages which impacts the term statistics.

Since a term in Lucene is field:term, for languages having common words, term statistics will be a concern, especially if in one language the term is a frequently used word and in other it is an uncommon word. Worst case being a stop word in one language and important term in other language. If this is the case, it is a no go strategy. However, for your language set, it is possible that there is no impact on the term statistics and vocabularies in different languages are mutually exclusive. In this case you could expect the TFIDFSimilarity to work. In case you are using other Similarity classes, they should mostly work well if TFIDF works.

For other strategies:

It definitely depends on a)No of languages to support (say m) b)No of fields which need to be multilingual.(say n)

In case both m and n are less, then you can go for a multifields approach:

(en -english, jp - Japanese, fr - French)
field1_en, field1_jp , field1_fr,
field2_en, field2_jp , field2_fr.

Unless you have hit m*n more than 1000+ fields, this is a safe strategy. Lucene's performance goes down when no of fields are huge.

In case no of languages are very few then different index folder (different schema) can work - but note that if you need to return results from different languages, it is a concern in many search engines. Elastic Search does well though.

Upvotes: 3

Related Questions