Reputation: 17392
I'm thinking about using Sphinx as a search engine for my site. But since I have a lot of Korean content, and other languages like Chinese and Thai may follow, I wonder how well Sphinx can handle this type of content.
Upvotes: 2
Views: 1199
Reputation: 2451
In thinking sphinx 3:-
Create a thinking_sphinx.yml
file inside config
folder and put these lines as :-
development:
enable_star: 1
min_infix_len: 3
ngram_len: 1
ngram_chars: U+4E00..U+9FBB, U+3400..U+4DB5, U+20000..U+2A6D6, U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, U+FA29, U+3105..U+312C, U+31A0..U+31B7, U+3041, U+3043, U+3045, U+3047, U+3049, U+304B, U+304D, U+304F, U+3051, U+3053, U+3055, U+3057, U+3059, U+305B, U+305D, U+305F, U+3061, U+3063, U+3066, U+3068, U+306A..U+306F, U+3072, U+3075, U+3078, U+307B, U+307E..U+3083, U+3085, U+3087, U+3089..U+308E, U+3090..U+3093, U+30A1, U+30A3, U+30A5, U+30A7, U+30A9, U+30AD, U+30AF, U+30B3, U+30B5, U+30BB, U+30BD, U+30BF, U+30C1, U+30C3, U+30C4, U+30C6, U+30CA, U+30CB, U+30CD, U+30CE, U+30DE, U+30DF, U+30E1, U+30E2, U+30E3, U+30E5, U+30E7, U+30EE, U+30F0..U+30F3, U+30F5, U+30F6, U+31F0, U+31F1, U+31F2, U+31F3, U+31F4, U+31F5, U+31F6, U+31F7, U+31F8, U+31F9, U+31FA, U+31FB, U+31FC, U+31FD, U+31FE, U+31FF, U+AC00..U+D7A3, U+1100..U+1159, U+1161..U+11A2, U+11A8..U+11F9, U+A000..U+A48C, U+A492..U+A4C6
charset_table: 0..9, A..Z, U+00C0..U+00DE, U+0100, U+0102, U+0104, U+0106, U+0108, U+010A, U+010C, U+010E, U+0110, U+0112, U+0114, U+0116, U+0118, U+011A, U+011C, U+011E, U+0120, U+0122, U+0124, U+0126, U+0128, U+012A, U+012C, U+012E, U+0130, U+0132, U+0134, U+0136, U+0139, U+013B, U+013D, U+013F, U+0141, U+0143, U+0145, U+0147, U+014A, U+014C, U+014E, U+0150, U+0152, U+0154, U+0156, U+0158, U+015A, U+015C, U+015E, U+0160, U+0162, U+0164, U+0166, U+0168, U+016A, U+016C, U+016E, U+0170, U+0172, U+0174, U+0176, U+0178, U+0179, U+017B, U+017D, a..z, U+00DF..U+00F6, U+00F8..U+00FF, U+0101, U+0103, U+0105, U+0107, U+0109, U+010B, U+010D, U+010F, U+0111, U+0113, U+0115, U+0117, U+0119, U+011B, U+011D, U+011F, U+0121, U+0123, U+0125, U+0127, U+0129, U+012B, U+012D, U+012F, U+0131, U+0133, U+0135, U+0137, U+0138, U+013A, U+013C, U+013E, U+0140, U+0142, U+0144, U+0146, U+0148, U+0149, U+014B, U+014D, U+014F, U+0151, U+0153, U+0155, U+0157, U+0159, U+015B, U+015D, U+015F, U+0161, U+0163, U+0165, U+0167, U+0169, U+016B, U+016D, U+016F, U+0171, U+0173, U+0175, U+0177, U+017A, U+017C, U+017E, U+017F, U+0027
test:
enable_star: 1
min_infix_len: 1
production:
enable_star: 1
min_infix_len: 3
ngram_len: 1
enable_star: true
ngram_chars: U+4E00..U+9FBB, U+3400..U+4DB5, U+20000..U+2A6D6, U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, U+FA29, U+3105..U+312C, U+31A0..U+31B7, U+3041, U+3043, U+3045, U+3047, U+3049, U+304B, U+304D, U+304F, U+3051, U+3053, U+3055, U+3057, U+3059, U+305B, U+305D, U+305F, U+3061, U+3063, U+3066, U+3068, U+306A..U+306F, U+3072, U+3075, U+3078, U+307B, U+307E..U+3083, U+3085, U+3087, U+3089..U+308E, U+3090..U+3093, U+30A1, U+30A3, U+30A5, U+30A7, U+30A9, U+30AD, U+30AF, U+30B3, U+30B5, U+30BB, U+30BD, U+30BF, U+30C1, U+30C3, U+30C4, U+30C6, U+30CA, U+30CB, U+30CD, U+30CE, U+30DE, U+30DF, U+30E1, U+30E2, U+30E3, U+30E5, U+30E7, U+30EE, U+30F0..U+30F3, U+30F5, U+30F6, U+31F0, U+31F1, U+31F2, U+31F3, U+31F4, U+31F5, U+31F6, U+31F7, U+31F8, U+31F9, U+31FA, U+31FB, U+31FC, U+31FD, U+31FE, U+31FF, U+AC00..U+D7A3, U+1100..U+1159, U+1161..U+11A2, U+11A8..U+11F9, U+A000..U+A48C, U+A492..U+A4C6
charset_table: 0..9, A..Z, U+00C0..U+00DE, U+0100, U+0102, U+0104, U+0106, U+0108, U+010A, U+010C, U+010E, U+0110, U+0112, U+0114, U+0116, U+0118, U+011A, U+011C, U+011E, U+0120, U+0122, U+0124, U+0126, U+0128, U+012A, U+012C, U+012E, U+0130, U+0132, U+0134, U+0136, U+0139, U+013B, U+013D, U+013F, U+0141, U+0143, U+0145, U+0147, U+014A, U+014C, U+014E, U+0150, U+0152, U+0154, U+0156, U+0158, U+015A, U+015C, U+015E, U+0160, U+0162, U+0164, U+0166, U+0168, U+016A, U+016C, U+016E, U+0170, U+0172, U+0174, U+0176, U+0178, U+0179, U+017B, U+017D, a..z, U+00DF..U+00F6, U+00F8..U+00FF, U+0101, U+0103, U+0105, U+0107, U+0109, U+010B, U+010D, U+010F, U+0111, U+0113, U+0115, U+0117, U+0119, U+011B, U+011D, U+011F, U+0121, U+0123, U+0125, U+0127, U+0129, U+012B, U+012D, U+012F, U+0131, U+0133, U+0135, U+0137, U+0138, U+013A, U+013C, U+013E, U+0140, U+0142, U+0144, U+0146, U+0148, U+0149, U+014B, U+014D, U+014F, U+0151, U+0153, U+0155, U+0157, U+0159, U+015B, U+015D, U+015F, U+0161, U+0163, U+0165, U+0167, U+0169, U+016B, U+016D, U+016F, U+0171, U+0173, U+0175, U+0177, U+017A, U+017C, U+017E, U+017F, U+0027
See Unicode Character Set Tables for more.
Upvotes: 0
Reputation: 2084
I am using Sphinx to search CJK charcters (Chinese, Japanese, and Korean), what you need to do is to add the following lines in your index block of your configuration file.
index test { ... charset_type = utf-8 ngram_len = 1 ngram_chars = U+3000..U+2FA1F }
Upvotes: 4
Reputation: 623
Sphinx works well for UTF-8 characters (which includes Korean I believe), but you'll have to include a list of UTF-8 characters codes to index in your sphinx config file.
This is how my charset_table variable looks like in sphinx config, to add all kinds of characters from European languages:
charset_table = 0..9, A..Z, U+00C0..U+00DE, U+0100, U+0102, U+0104, U+0106, U+0108, U+010A, U+010C, U+010E, U+0110, U+0112, U+0114, U+0116, U+0118, U+011A, U+011C, U+011E, U+0120, U+0122, U+0124, U+0126, U+0128, U+012A, U+012C, U+012E, U+0130, U+0132, U+0134, U+0136, U+0139, U+013B, U+013D, U+013F, U+0141, U+0143, U+0145, U+0147, U+014A, U+014C, U+014E, U+0150, U+0152, U+0154, U+0156, U+0158, U+015A, U+015C, U+015E, U+0160, U+0162, U+0164, U+0166, U+0168, U+016A, U+016C, U+016E, U+0170, U+0172, U+0174, U+0176, U+0178, U+0179, U+017B, U+017D, a..z, U+00DF..U+00F6, U+00F8..U+00FF, U+0101, U+0103, U+0105, U+0107, U+0109, U+010B, U+010D, U+010F, U+0111, U+0113, U+0115, U+0117, U+0119, U+011B, U+011D, U+011F, U+0121, U+0123, U+0125, U+0127, U+0129, U+012B, U+012D, U+012F, U+0131, U+0133, U+0135, U+0137, U+0138, U+013A, U+013C, U+013E, U+0140, U+0142, U+0144, U+0146, U+0148, U+0149, U+014B, U+014D, U+014F, U+0151, U+0153, U+0155, U+0157, U+0159, U+015B, U+015D, U+015F, U+0161, U+0163, U+0165, U+0167, U+0169, U+016B, U+016D, U+016F, U+0171, U+0173, U+0175, U+0177, U+017A, U+017C, U+017E, U+017F, U+0027
Upvotes: 2