Reputation: 386
We are loading big amounts of data (~200 million rows, ~10GB in size) from SQL Server. The server is SQL Server 2019, running on a Linux machine, with plenty of memory and very fast network. The data is later processed by Spark application, but that doesn't matter, we get the same results with a test application.
The table we're querying is very wide, it has about 100 columns, and the query selects some of those columns, based on the time range. And, since we're running it from Spark, it is distributed among executors and additional time filter is added.
Something like:
select *
from
(select time, col1, col5, col38
from table
where time > '2020-01-01 00:00:00.000'
and time <= '2020-02-01 00:00:00.000') subq
where time > '2020-01-05 00:00:00.000'
and time <= '2020-01-06 00:00:00.000'
Initially, the table had only the unique clustered index (same as PK) on id
(not used in our queries, but essential for other usages) and time
, in that specific order. That, of course, gave us unacceptable results, like hours of loading time.
After we've added non-clustered index on the time column, the timings significantly improved, the loading started to take about 30-40 minutes.
Then we added all the queried columns to the non-clustered index, as 'included' columns. The performance improved dramatically, now that specific query completes in 20-30 seconds!
However, the problem here is that we don't know the exact columns in advance, there can be (almost) any combination of them. And building a new index takes hours.
So, we thought that non-clustered index on the time column with all included columns is essentially the same as the clustered index on time column, we've created a copy of the table with clustered index on (time, id), i.e. changed the order of columns in the clustered index. We were expecting that the performance will be the same as for the non-clustered index, but not - it still takes a lot of time.
When we run the single query in the Query Analyzer, it returns immediately, and the query plan shows that clustered index seek is used.
So, the questions are - why the clustered index behaves so badly, and are there any other ways, except of creating non-clustered index with all columns included, to get the good performance?
EDIT: I've checked the SQL Profiler output - the queries use the indexes exactly as they are supposed to do - the fast one uses non-clustered seek, the slow one uses clustered seek. There are no other steps in the plan - the query is really simple. Both of the queries show approximately the same CPU counter, but the slow (clustered) one makes 10 times more reads.
Upvotes: 0
Views: 417
Reputation: 6685
Short version - Something doesn't sound correct.
I suggest
OPTION (RECOMPILE)
to the query in case there is a poor cached query planAs an addendum, it is worth reviewing/posting the actual query plans and/or the runs with SET STATISTICS TIME, IO ON;
- and comparing good runs to bad ones. I often find the statistics run points me to problematic areas quickly, though the plan tells me what it's doing with those.
Broader thoughts
The fastest time will be the relevant covering non-clustered index (e.g., the one on Time and has all the fields you need, but no more).
The version with just a non-clustered index on Time (no other fields) would need to do a Seek-lookup process (a nested loop join) to get the rest of the data from the clustered index, one row at a time. This is probably why your performance is better with the Time non-clustered index, but is still slow (e.g., if the filtering finds 100,000 rows, it needs to go back to the clustered index 100,000 times to get the extra columns).
The non-clustered index including all the relevant fields overcomes the issue about the nested loop join - and is therefore a lot faster.
The thing that surprises me is that a clustered index read would be so much slower. If you have a useful clustered index (Time) so it can do a clustered index seek, then it shouldn't be that much slower. The extra time it needs would be to read all the other columns (e.g., the clustered index reads will need to read all the columns as well) but the number of rows read should be the same.
That's why I would suggest there seems to be something wrong in the process - it doesn't sound like it's doing a clustered index seek.
It is worth looking at the execution plan of the process - it may be that the insert is then taking a lot of time as it needs to sort the data before the insert (e.g., as the clustered index is not on Time, but the insert target's clustered index is ID, it would need to re-sort the data before inserting it). However, this would be the same whether it was getting the data from your covering non-clustered index, or new clustered index.
Upvotes: 1