Range query with non-clustered index, including all of the queried fields, runs much faster than the same query with clustered index

Question

We are loading big amounts of data (~200 million rows, ~10GB in size) from SQL Server. The server is SQL Server 2019, running on a Linux machine, with plenty of memory and very fast network. The data is later processed by Spark application, but that doesn't matter, we get the same results with a test application.

The table we're querying is very wide, it has about 100 columns, and the query selects some of those columns, based on the time range. And, since we're running it from Spark, it is distributed among executors and additional time filter is added.

Something like:

select * 
from 
    (select time, col1, col5, col38 
     from table 
     where time > '2020-01-01 00:00:00.000' 
       and time <= '2020-02-01 00:00:00.000') subq 
where time > '2020-01-05 00:00:00.000' 
  and time <= '2020-01-06 00:00:00.000'

Initially, the table had only the unique clustered index (same as PK) on id (not used in our queries, but essential for other usages) and time, in that specific order. That, of course, gave us unacceptable results, like hours of loading time.

After we've added non-clustered index on the time column, the timings significantly improved, the loading started to take about 30-40 minutes.

Then we added all the queried columns to the non-clustered index, as 'included' columns. The performance improved dramatically, now that specific query completes in 20-30 seconds!

However, the problem here is that we don't know the exact columns in advance, there can be (almost) any combination of them. And building a new index takes hours.

So, we thought that non-clustered index on the time column with all included columns is essentially the same as the clustered index on time column, we've created a copy of the table with clustered index on (time, id), i.e. changed the order of columns in the clustered index. We were expecting that the performance will be the same as for the non-clustered index, but not - it still takes a lot of time.

When we run the single query in the Query Analyzer, it returns immediately, and the query plan shows that clustered index seek is used.

So, the questions are - why the clustered index behaves so badly, and are there any other ways, except of creating non-clustered index with all columns included, to get the good performance?

EDIT: I've checked the SQL Profiler output - the queries use the indexes exactly as they are supposed to do - the fast one uses non-clustered seek, the slow one uses clustered seek. There are no other steps in the plan - the query is really simple. Both of the queries show approximately the same CPU counter, but the slow (clustered) one makes 10 times more reads.

Range query with non-clustered index, including all of the queried fields, runs much faster than the same query with clustered index

Answers (1)

Related Questions