Continuation
Continuation

Reputation: 13040

Lucene / Solr: what request handlers to use for query strings in Chinese or Japanese?

For my Solr server, some of the query strings will be in Asian languages such as Chinese or Japanese.

For such query strings, would the Standard or Dismax request handler work? My understanding is that both the Standard and the Dismax handler tokenize the query string by whitespace. And that wouldn't work for Chinese or Japanese, right?

In that case, what request handler should I use? And if I need to set up custom request handlers for those languages, how do I do it?

Thanks.

Upvotes: 0

Views: 1525

Answers (2)

Nick Zadrozny
Nick Zadrozny

Reputation: 7944

Your queries will be parsed according to the analyzers of the fields you're querying, whether you're using the standard Solr query parser or DisMax query parser.

So in this case, as Mauricio says, the question is about how your strings of text are analyzed into tokens.

For Chinese and Korean, there is CJK, which performs basic N-Gram analysis to break down text into byte pairs. It's not the best way to analyze in terms of relevance and index size, but it works.

For Japanese, I highly recommend the new Kuromoji morphological analyzers in Solr and Lucene 3.6.0. It uses a dictionary and some other statistics to tokenize into real terms. That lets you do all sorts of really excellent quality

Docs are sparse at the moment, so check out these links…

Upvotes: 1

Mauricio Scheffer
Mauricio Scheffer

Reputation: 99720

It's not about the request handler but the language analyzers.

Lucene has a CJK package for this purpose. See here for info on using it in Solr.

See also this thread for alternatives.

Upvotes: 1

Related Questions