Sphinx n-gram & charset_table

Question

I'm modifying the charset mapping for a sphinx cluster and I've run into a bit of an oddity, one which the documentation does not cover. The previous author of the charset_table and ngram_chars definitions has put the CJK unicode ranges in both charset mapping and ngrams.

Is this necessary?

If not, what is the purpose of this duplication?

S&#233;bastien Renauld · Accepted Answer

I am going to answer my own question after doing some extensive testing. As it turns out, charset_table and ngram_chars complement each other rather than one being a subset of the other.

Testing run

Docset

Just charset_table

using config file 'sphinx.conf'...
index 'i_blah': query 'ぇ ': returned 0 matches of 0 total in 0.000 sec

using config file 'sphinx.conf'...
index 'i_blah': query 'ぇえぉおかがきぎく ': returned 1 matches of 1 total in 0.000 sec

displaying matches:
1. document=123, weight=1500

words:
1. 'ぇえぉおかがきぎく': 1 documents, 1 hits

Just ngram_chars

using config file 'sphinx.conf'...
index 'i_blah': query 'ぇえぉおかがきぎく ': returned 1 matches of 1 total in 0.000 sec

displaying matches:
1. document=123, weight=9500

words:
1. 'ぇ': 1 documents, 1 hits
2. 'え': 1 documents, 1 hits
3. 'ぉ': 1 documents, 1 hits
4. 'お': 1 documents, 1 hits
5. 'か': 1 documents, 1 hits
6. 'が': 1 documents, 1 hits
7. 'き': 1 documents, 1 hits
8. 'ぎ': 1 documents, 1 hits
9. 'く': 1 documents, 1 hits

So, the presence of a character in charset_table does not in any way affect the indexing if the character is present in ngram_chars. They do not depend on one another.

Sphinx n-gram & charset_table

Answers (2)

Testing run

Related Questions

Sphinx n-gram &amp; charset_table

Answers (2)

Testing run

Related Questions

Sphinx n-gram & charset_table