Sébastien Renauld
Sébastien Renauld

Reputation: 19662

Sphinx n-gram & charset_table

I'm modifying the charset mapping for a sphinx cluster and I've run into a bit of an oddity, one which the documentation does not cover. The previous author of the charset_table and ngram_chars definitions has put the CJK unicode ranges in both charset mapping and ngrams.

Is this necessary?

If not, what is the purpose of this duplication?

Upvotes: 0

Views: 1063

Answers (2)

Sébastien Renauld
Sébastien Renauld

Reputation: 19662

I am going to answer my own question after doing some extensive testing. As it turns out, charset_table and ngram_chars complement each other rather than one being a subset of the other.

Testing run

Docset

 <?xml version="1.0" encoding="utf-8"?>
 <sphinx:docset>
 <sphinx:schema>
    <sphinx:field name="foo"/>
     </sphinx:schema>
     <sphinx:document id="123">
    <foo><![CDATA[ぇえぉおかがきぎく]]></foo>
</sphinx:document>
</sphinx:docset>

Just charset_table

using config file 'sphinx.conf'...
index 'i_blah': query 'ぇ ': returned 0 matches of 0 total in 0.000 sec

using config file 'sphinx.conf'...
index 'i_blah': query 'ぇえぉおかがきぎく ': returned 1 matches of 1 total in 0.000 sec

displaying matches:
1. document=123, weight=1500

words:
1. 'ぇえぉおかがきぎく': 1 documents, 1 hits

Just ngram_chars

using config file 'sphinx.conf'...
index 'i_blah': query 'ぇえぉおかがきぎく ': returned 1 matches of 1 total in 0.000 sec

displaying matches:
1. document=123, weight=9500

words:
1. 'ぇ': 1 documents, 1 hits
2. 'え': 1 documents, 1 hits
3. 'ぉ': 1 documents, 1 hits
4. 'お': 1 documents, 1 hits
5. 'か': 1 documents, 1 hits
6. 'が': 1 documents, 1 hits
7. 'き': 1 documents, 1 hits
8. 'ぎ': 1 documents, 1 hits
9. 'く': 1 documents, 1 hits

So, the presence of a character in charset_table does not in any way affect the indexing if the character is present in ngram_chars. They do not depend on one another.

Upvotes: 1

barryhunter
barryhunter

Reputation: 21091

I admit never used ngram_chars, but I think chars listed in ngram_chars do also need to be in charset_table

'charset_table', defines all chars that get indexed, then 'ngram_chars' defines ones that get segmented.

if only in 'charset_table' then will be indexed as normal words

if only in 'ngram_chars' then have no effect.

Upvotes: 0

Related Questions