jnr
jnr

Reputation: 151

Why does MariaDB stop using an index for larger date ranges?

I'm scratching my head over a query optimization problem in MariaDB. I have a table with 300 000 000 rows and with an index on the SEND_DATE column, and the following query runs very fast:

    SELECT  
        *  
    FROM  
        MESSAGE  
    WHERE  
        SEND_DATE >= '2024-01-01 00:00:00'  
        AND SEND_DATE < '2024-06-01 00:00:00';  

Execution time: 0.044 seconds. 🚀 (plus fetch time ~150s)

However, when I increase the date range by just one month, like this:

    SELECT  
        *  
    FROM  
        MESSAGE  
    WHERE  
        SEND_DATE >= '2024-01-01 00:00:00'  
        AND SEND_DATE < '2024-07-01 00:00:00';  

The query suddenly slows down dramatically – execution time jumps to 466 seconds. 😱 (plus fetch time ~150s)

It seems that MariaDB stops using the index and switches to a full table scan instead.

Interestingly, when I force the index using FORCE INDEX, the query runs fast again, even for much larger date ranges:

    SELECT  
        *  
    FROM  
        MESSAGE FORCE INDEX (INDEX_SEND_DATE)  
    WHERE  
        SEND_DATE >= '2024-01-01 00:00:00'  
        AND SEND_DATE < '2024-07-01 00:00:00'; 

With FORCE INDEX, the query performs as expected. 🚀

What I’ve Tried:

My Question:

How can I configure MariaDB to always prefer the index for this type of query, even for larger date ranges? I’d prefer not to manually force the index with FORCE INDEX in every query.

Any suggestions or insights are greatly appreciated! 😊

Edit: 1

DB version: 10.4.12

I made ANALYZE FORMAT=JSON For narrower date range:

{
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 68531,
    "table": {
      "table_name": "MESSAGE",
      "access_type": "range",
      "possible_keys": ["INDEX_SEND_DATE"],
      "key": "INDEX_SEND_DATE",
      "key_length": "5",
      "used_key_parts": ["SEND_DATE"],
      "r_loops": 1,
      "rows": 24805704,
      "r_rows": 1.38e7,
      "r_total_time_ms": 65458,
      "filtered": 100,
      "r_filtered": 100,
      "index_condition": "MESSAGE.SEND_DATE >= '2024-01-01 00:00:00' and MESSAGE.SEND_DATE < '2024-05-01 00:00:00'"
    }
  }
}

For wider date range:

{
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 632820,
    "table": {
      "table_name": "MESSAGE",
      "access_type": "ALL",
      "possible_keys": ["INDEX_SEND_DATE"],
      "r_loops": 1,
      "rows": 241889492,
      "r_rows": 2.99e8,
      "r_total_time_ms": 582567,
      "filtered": 24.389,
      "r_filtered": 8.9888,
      "attached_condition": "MESSAGE.SEND_DATE >= '2024-01-01 00:00:00' and MESSAGE.SEND_DATE < '2024-08-01 00:00:00'"
    }
  }
}

Upvotes: 3

Views: 84

Answers (1)

Rick James
Rick James

Reputation: 142503

This is a design consideration built into the Optimizer.

The Optimizer looks in the index's BTree to guestimate the number of rows matching the range. If that is more than some threshold (often about 20%), it will decide that a table scan will be faster. Note: There is no practical way to get a better cutoff.

Why does that matter? Let me explain how an index works (at least in your example).

  1. Drill down into the index's B+Tree to the first value in the range ('2024-01-01 00:00:00').
  2. Walk forward through the index until the end value.
  3. For each index entry found, drill into the data's BTree to find the desired row (SELECT *).

Note the back and forth and the many BTree lookups. These cost something; it is hard to know how much.

The alternative to doing a table scan is to look at every row in the table, essentially a sequential read through the B+Tree. No back and forth; no drilling down BTrees.

Conclusion: Any of these cases will make using the index slower than a table scan:

  • Lots of I/O to fetch the data blocks.
  • The desired rows are scattered around the table.
  • innodb_buffer_pool_size os too small for the task.
  • And, of course, "too many" rows are needed.

See if

SELECT * FROM Messages
    WHERE SEND_DATE >= '2024-01-01'  
      AND SEND_DATE <  '2024-08-01'
    ORDER BY SEND_DATE;  

will work faster. Note that the Optimizer may decide that the ORDER BY will avoid a sort, thereby changing the decision process it uses.

Upvotes: 0

Related Questions