iGEL
iGEL

Reputation: 17392

Does Sphinx handle content in asian languages well?

I'm thinking about using Sphinx as a search engine for my site. But since I have a lot of Korean content, and other languages like Chinese and Thai may follow, I wonder how well Sphinx can handle this type of content.

Upvotes: 2

Views: 1199

Answers (3)

Shamsul Haque
Shamsul Haque

Reputation: 2451

In thinking sphinx 3:-

Create a thinking_sphinx.yml file inside config folder and put these lines as :-

development:
  enable_star: 1
  min_infix_len: 3
  ngram_len: 1
  ngram_chars: U+4E00..U+9FBB, U+3400..U+4DB5, U+20000..U+2A6D6, U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, U+FA29, U+3105..U+312C, U+31A0..U+31B7, U+3041, U+3043, U+3045, U+3047, U+3049, U+304B, U+304D, U+304F, U+3051, U+3053, U+3055, U+3057, U+3059, U+305B, U+305D, U+305F, U+3061, U+3063, U+3066, U+3068, U+306A..U+306F, U+3072, U+3075, U+3078, U+307B, U+307E..U+3083, U+3085, U+3087, U+3089..U+308E, U+3090..U+3093, U+30A1, U+30A3, U+30A5, U+30A7, U+30A9, U+30AD, U+30AF, U+30B3, U+30B5, U+30BB, U+30BD, U+30BF, U+30C1, U+30C3, U+30C4, U+30C6, U+30CA, U+30CB, U+30CD, U+30CE, U+30DE, U+30DF, U+30E1, U+30E2, U+30E3, U+30E5, U+30E7, U+30EE, U+30F0..U+30F3, U+30F5, U+30F6, U+31F0, U+31F1, U+31F2, U+31F3, U+31F4, U+31F5, U+31F6, U+31F7, U+31F8, U+31F9, U+31FA, U+31FB, U+31FC, U+31FD, U+31FE, U+31FF, U+AC00..U+D7A3, U+1100..U+1159, U+1161..U+11A2, U+11A8..U+11F9, U+A000..U+A48C, U+A492..U+A4C6
  charset_table: 0..9, A..Z, U+00C0..U+00DE, U+0100, U+0102, U+0104, U+0106, U+0108, U+010A, U+010C, U+010E, U+0110, U+0112, U+0114, U+0116, U+0118, U+011A, U+011C, U+011E, U+0120, U+0122, U+0124, U+0126, U+0128, U+012A, U+012C, U+012E, U+0130, U+0132, U+0134, U+0136, U+0139, U+013B, U+013D, U+013F, U+0141, U+0143, U+0145, U+0147, U+014A, U+014C, U+014E, U+0150, U+0152, U+0154, U+0156, U+0158, U+015A, U+015C, U+015E, U+0160, U+0162, U+0164, U+0166, U+0168, U+016A, U+016C, U+016E, U+0170, U+0172, U+0174, U+0176, U+0178, U+0179, U+017B, U+017D, a..z, U+00DF..U+00F6, U+00F8..U+00FF, U+0101, U+0103, U+0105, U+0107, U+0109, U+010B, U+010D, U+010F, U+0111, U+0113, U+0115, U+0117, U+0119, U+011B, U+011D, U+011F, U+0121, U+0123, U+0125, U+0127, U+0129, U+012B, U+012D, U+012F, U+0131, U+0133, U+0135, U+0137, U+0138, U+013A, U+013C, U+013E, U+0140, U+0142, U+0144, U+0146, U+0148, U+0149, U+014B, U+014D, U+014F, U+0151, U+0153, U+0155, U+0157, U+0159, U+015B, U+015D, U+015F, U+0161, U+0163, U+0165, U+0167, U+0169, U+016B, U+016D, U+016F, U+0171, U+0173, U+0175, U+0177, U+017A, U+017C, U+017E, U+017F, U+0027
test:
  enable_star: 1
  min_infix_len: 1
production:
  enable_star: 1
  min_infix_len: 3
  ngram_len: 1
  enable_star: true
  ngram_chars: U+4E00..U+9FBB, U+3400..U+4DB5, U+20000..U+2A6D6, U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, U+FA29, U+3105..U+312C, U+31A0..U+31B7, U+3041, U+3043, U+3045, U+3047, U+3049, U+304B, U+304D, U+304F, U+3051, U+3053, U+3055, U+3057, U+3059, U+305B, U+305D, U+305F, U+3061, U+3063, U+3066, U+3068, U+306A..U+306F, U+3072, U+3075, U+3078, U+307B, U+307E..U+3083, U+3085, U+3087, U+3089..U+308E, U+3090..U+3093, U+30A1, U+30A3, U+30A5, U+30A7, U+30A9, U+30AD, U+30AF, U+30B3, U+30B5, U+30BB, U+30BD, U+30BF, U+30C1, U+30C3, U+30C4, U+30C6, U+30CA, U+30CB, U+30CD, U+30CE, U+30DE, U+30DF, U+30E1, U+30E2, U+30E3, U+30E5, U+30E7, U+30EE, U+30F0..U+30F3, U+30F5, U+30F6, U+31F0, U+31F1, U+31F2, U+31F3, U+31F4, U+31F5, U+31F6, U+31F7, U+31F8, U+31F9, U+31FA, U+31FB, U+31FC, U+31FD, U+31FE, U+31FF, U+AC00..U+D7A3, U+1100..U+1159, U+1161..U+11A2, U+11A8..U+11F9, U+A000..U+A48C, U+A492..U+A4C6
  charset_table: 0..9, A..Z, U+00C0..U+00DE, U+0100, U+0102, U+0104, U+0106, U+0108, U+010A, U+010C, U+010E, U+0110, U+0112, U+0114, U+0116, U+0118, U+011A, U+011C, U+011E, U+0120, U+0122, U+0124, U+0126, U+0128, U+012A, U+012C, U+012E, U+0130, U+0132, U+0134, U+0136, U+0139, U+013B, U+013D, U+013F, U+0141, U+0143, U+0145, U+0147, U+014A, U+014C, U+014E, U+0150, U+0152, U+0154, U+0156, U+0158, U+015A, U+015C, U+015E, U+0160, U+0162, U+0164, U+0166, U+0168, U+016A, U+016C, U+016E, U+0170, U+0172, U+0174, U+0176, U+0178, U+0179, U+017B, U+017D, a..z, U+00DF..U+00F6, U+00F8..U+00FF, U+0101, U+0103, U+0105, U+0107, U+0109, U+010B, U+010D, U+010F, U+0111, U+0113, U+0115, U+0117, U+0119, U+011B, U+011D, U+011F, U+0121, U+0123, U+0125, U+0127, U+0129, U+012B, U+012D, U+012F, U+0131, U+0133, U+0135, U+0137, U+0138, U+013A, U+013C, U+013E, U+0140, U+0142, U+0144, U+0146, U+0148, U+0149, U+014B, U+014D, U+014F, U+0151, U+0153, U+0155, U+0157, U+0159, U+015B, U+015D, U+015F, U+0161, U+0163, U+0165, U+0167, U+0169, U+016B, U+016D, U+016F, U+0171, U+0173, U+0175, U+0177, U+017A, U+017C, U+017E, U+017F, U+0027

See Unicode Character Set Tables for more.

Upvotes: 0

tszming
tszming

Reputation: 2084

I am using Sphinx to search CJK charcters (Chinese, Japanese, and Korean), what you need to do is to add the following lines in your index block of your configuration file.

index test {
  ...
  charset_type = utf-8
  ngram_len = 1
  ngram_chars = U+3000..U+2FA1F
}

Upvotes: 4

Tom Ilsinszki
Tom Ilsinszki

Reputation: 623

Sphinx works well for UTF-8 characters (which includes Korean I believe), but you'll have to include a list of UTF-8 characters codes to index in your sphinx config file.

This is how my charset_table variable looks like in sphinx config, to add all kinds of characters from European languages:

charset_table       = 0..9, A..Z, U+00C0..U+00DE, U+0100, U+0102, U+0104, U+0106, U+0108, U+010A, U+010C, U+010E, U+0110, U+0112, U+0114, U+0116, U+0118, U+011A, U+011C, U+011E, U+0120, U+0122, U+0124, U+0126, U+0128, U+012A, U+012C, U+012E, U+0130, U+0132, U+0134, U+0136, U+0139, U+013B, U+013D, U+013F, U+0141, U+0143, U+0145, U+0147, U+014A, U+014C, U+014E, U+0150, U+0152, U+0154, U+0156, U+0158, U+015A, U+015C, U+015E, U+0160, U+0162, U+0164, U+0166, U+0168, U+016A, U+016C, U+016E, U+0170, U+0172, U+0174, U+0176, U+0178, U+0179, U+017B, U+017D, a..z, U+00DF..U+00F6, U+00F8..U+00FF, U+0101, U+0103, U+0105, U+0107, U+0109, U+010B, U+010D, U+010F, U+0111, U+0113, U+0115, U+0117, U+0119, U+011B, U+011D, U+011F, U+0121, U+0123, U+0125, U+0127, U+0129, U+012B, U+012D, U+012F, U+0131, U+0133, U+0135, U+0137, U+0138, U+013A, U+013C, U+013E, U+0140, U+0142, U+0144, U+0146, U+0148, U+0149, U+014B, U+014D, U+014F, U+0151, U+0153, U+0155, U+0157, U+0159, U+015B, U+015D, U+015F, U+0161, U+0163, U+0165, U+0167, U+0169, U+016B, U+016D, U+016F, U+0171, U+0173, U+0175, U+0177, U+017A, U+017C, U+017E, U+017F, U+0027

Upvotes: 2

Related Questions