Reputation: 1086
I noticed a strange behavior in the sort order using Sphinx 2.2.8 (same result with v2.3.1-beta).
I expect exact matching to appear on first position (I set index_exact_words and expan_keywords for that).
That works well on my first example below with two rows. But if I add more rows, weights change and my exact match result (id=1) gets a lower rank than other approximate one!
For example, indexing these 2 words (some french words with morphology libstemmer_fr):
source nptest
{
type = pgsql
sql_host = localhost
sql_user = myuser
sql_pass = mypassword
sql_db = mydb
sql_port = 5432
sql_query = \
SELECT 1, 'chien' AS val \
UNION \
SELECT 2, 'chienne'
sql_field_string = val
}
index nptest
{
type = plain
mlock = 1
source = nptest
path = /var/lib/sphinx/data/nptest
morphology = libstemmer_fr
index_exact_words = 1
expand_keywords = 1
}
After indexing (indexer --rotate nptest):
mysql> SELECT id, val, weight() FROM nptest WHERE match('chien');
+------+---------+----------+
| id | val | weight() |
+------+---------+----------+
| 1 | chien | 1500 |
| 2 | chienne | 1428 |
+------+---------+----------+
2 rows in set (0.00 sec)
The word "chien" has a higher rank than "chienne" => that's what I expected.
Now I add more rows to my db:
source nptest
{
type = pgsql
sql_host = localhost
sql_user = myuser
sql_pass = mypassword
sql_db = mydb
sql_port = 5432
sql_query = \
SELECT 1, 'chien' AS val \
UNION \
SELECT 2, 'chienne' \
UNION \
SELECT 3, 'un beau chien' \
UNION \
SELECT 4, 'chien-loup'
sql_field_string = val
}
mysql> SELECT id, val, weight() FROM nptest WHERE match('chien');
+------+---------------+----------+
| id | val | weight() |
+------+---------------+----------+
| 2 | chienne | 1402 |
| 1 | chien | 1373 |
| 3 | un beau chien | 1373 |
| 4 | chien-loup | 1373 |
+------+---------------+----------+
4 rows in set (0.00 sec)
After reindexing the highest rank is now on "chienne"!
Is this a normal behaviour (if so why?) or is it a bug? If it is not a bug, how can I ensure that exact matching will get the highest rank ?
Upvotes: 0
Views: 1098
Reputation: 1086
This is an expected behaviour.
In fact, BM25 based algorithms take into account the scarcity of keywords.
In the example above, the word "chienne" is rarer than the word "chien" so that it ranks higher.
On a real data set, it may work better than in the example.
Further reading is available from this post on sphinxsearch.com: http://sphinxsearch.com/forum/view.html?id=13348
Upvotes: 0
Reputation:
You probably need to check what the default ranker does for your version and figure out if you should use a different one. Check the "So how do I rank exact field matches higher?" question.
Upvotes: 1