Reputation: 13040
For my Solr server, some of the query strings will be in Asian languages such as Chinese or Japanese.
For such query strings, would the Standard or Dismax request handler work? My understanding is that both the Standard and the Dismax handler tokenize the query string by whitespace. And that wouldn't work for Chinese or Japanese, right?
In that case, what request handler should I use? And if I need to set up custom request handlers for those languages, how do I do it?
Thanks.
Upvotes: 0
Views: 1525
Reputation: 7944
Your queries will be parsed according to the analyzers of the fields you're querying, whether you're using the standard Solr query parser or DisMax query parser.
So in this case, as Mauricio says, the question is about how your strings of text are analyzed into tokens.
For Chinese and Korean, there is CJK, which performs basic N-Gram analysis to break down text into byte pairs. It's not the best way to analyze in terms of relevance and index size, but it works.
For Japanese, I highly recommend the new Kuromoji morphological analyzers in Solr and Lucene 3.6.0. It uses a dictionary and some other statistics to tokenize into real terms. That lets you do all sorts of really excellent quality
Docs are sparse at the moment, so check out these links…
Upvotes: 1
Reputation: 99720
It's not about the request handler but the language analyzers.
Lucene has a CJK package for this purpose. See here for info on using it in Solr.
See also this thread for alternatives.
Upvotes: 1