SphinxSearch exact match ranking

Question

I noticed a strange behavior in the sort order using Sphinx 2.2.8 (same result with v2.3.1-beta).

I expect exact matching to appear on first position (I set index_exact_words and expan_keywords for that).

That works well on my first example below with two rows. But if I add more rows, weights change and my exact match result (id=1) gets a lower rank than other approximate one!

For example, indexing these 2 words (some french words with morphology libstemmer_fr):

source nptest
{
        type                    = pgsql
        sql_host                = localhost
        sql_user                = myuser
        sql_pass                = mypassword
        sql_db                  = mydb
        sql_port                = 5432

        sql_query               = \
                                  SELECT 1, 'chien' AS val \
                                  UNION \
                                  SELECT 2, 'chienne'

        sql_field_string = val
}

index nptest
{
        type                    = plain
        mlock                   = 1
        source                  = nptest
        path                    = /var/lib/sphinx/data/nptest
        morphology              = libstemmer_fr
        index_exact_words       = 1
        expand_keywords         = 1
}

After indexing (indexer --rotate nptest):

mysql> SELECT id, val, weight() FROM nptest WHERE match('chien');
+------+---------+----------+
| id   | val     | weight() |
+------+---------+----------+
|    1 | chien   |     1500 |
|    2 | chienne |     1428 |
+------+---------+----------+
2 rows in set (0.00 sec)

The word "chien" has a higher rank than "chienne" => that's what I expected.

Now I add more rows to my db:

source nptest
{
        type                    = pgsql
        sql_host                = localhost
        sql_user                = myuser
        sql_pass                = mypassword
        sql_db                  = mydb
        sql_port                = 5432

        sql_query               = \
                SELECT 1, 'chien' AS val \
                UNION \
                SELECT 2, 'chienne' \
                UNION \
                SELECT 3, 'un beau chien' \
                UNION \
                SELECT 4, 'chien-loup'

        sql_field_string = val
}


mysql> SELECT id, val, weight() FROM nptest WHERE match('chien');
+------+---------------+----------+
| id   | val           | weight() |
+------+---------------+----------+
|    2 | chienne       |     1402 |
|    1 | chien         |     1373 |
|    3 | un beau chien |     1373 |
|    4 | chien-loup    |     1373 |
+------+---------------+----------+
4 rows in set (0.00 sec)

After reindexing the highest rank is now on "chienne"!

Is this a normal behaviour (if so why?) or is it a bug? If it is not a bug, how can I ensure that exact matching will get the highest rank ?

Nicolas Payart · Accepted Answer

This is an expected behaviour.

In fact, BM25 based algorithms take into account the scarcity of keywords.

In the example above, the word "chienne" is rarer than the word "chien" so that it ranks higher.

On a real data set, it may work better than in the example.

Further reading is available from this post on sphinxsearch.com: http://sphinxsearch.com/forum/view.html?id=13348

SphinxSearch exact match ranking

Answers (2)

Related Questions