Reputation: 19662
I'm modifying the charset mapping for a sphinx cluster and I've run into a bit of an oddity, one which the documentation does not cover. The previous author of the charset_table
and ngram_chars
definitions has put the CJK unicode ranges in both charset mapping and ngrams.
Is this necessary?
If not, what is the purpose of this duplication?
Upvotes: 0
Views: 1063
Reputation: 19662
I am going to answer my own question after doing some extensive testing. As it turns out, charset_table
and ngram_chars
complement each other rather than one being a subset of the other.
Docset
<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>
<sphinx:schema>
<sphinx:field name="foo"/>
</sphinx:schema>
<sphinx:document id="123">
<foo><![CDATA[ぇえぉおかがきぎく]]></foo>
</sphinx:document>
</sphinx:docset>
Just charset_table
using config file 'sphinx.conf'...
index 'i_blah': query 'ぇ ': returned 0 matches of 0 total in 0.000 sec
using config file 'sphinx.conf'...
index 'i_blah': query 'ぇえぉおかがきぎく ': returned 1 matches of 1 total in 0.000 sec
displaying matches:
1. document=123, weight=1500
words:
1. 'ぇえぉおかがきぎく': 1 documents, 1 hits
Just ngram_chars
using config file 'sphinx.conf'...
index 'i_blah': query 'ぇえぉおかがきぎく ': returned 1 matches of 1 total in 0.000 sec
displaying matches:
1. document=123, weight=9500
words:
1. 'ぇ': 1 documents, 1 hits
2. 'え': 1 documents, 1 hits
3. 'ぉ': 1 documents, 1 hits
4. 'お': 1 documents, 1 hits
5. 'か': 1 documents, 1 hits
6. 'が': 1 documents, 1 hits
7. 'き': 1 documents, 1 hits
8. 'ぎ': 1 documents, 1 hits
9. 'く': 1 documents, 1 hits
So, the presence of a character in charset_table
does not in any way affect the indexing if the character is present in ngram_chars
. They do not depend on one another.
Upvotes: 1
Reputation: 21091
I admit never used ngram_chars, but I think chars listed in ngram_chars
do also need to be in charset_table
'charset_table', defines all chars that get indexed, then 'ngram_chars' defines ones that get segmented.
if only in 'charset_table' then will be indexed as normal words
if only in 'ngram_chars' then have no effect.
Upvotes: 0