SBFrancies
SBFrancies

Reputation: 4240

Azure SQL query performance significantly degrades when WHERE clause returns a high proportion of rows

I have been trying to get to bottom of a performance bottle neck in a web app I am developing. I have managed to identify the SQL query which is causing the problem but I am uncertain as to how to resolve it. The basic query is:

SELECT *
FROM Table
WHERE ColumnA = 0
ORDER BY AnotherColumn
OFFSET 0 ROWS FETCH NEXT 20 ROWS ONLY

Column A is of type BIT, is nullable and holds no default value. At the moment every row (around 290,000 in the table) holds a value of 0. Currently the query takes around 1 minute and 50 seconds to complete.

What I find odd is that by changing a small proportion of the values of ColumnA in the database to 1 the performance dramatically increases.

Simply by running:

UPDATE Table SET ColumnA = 1 WHERE ID % 100 = 0

Which switches the value in about 1% of the rows, the query time is reduced to 7 seconds - over 90% quicker.

I don't understand why there is such a dramatic difference and can think of no way to optimise the query to resolve the issue. Removing the WHERE clause entirely results in the same ~7 second query time so I do not think it has to do with the data which is being returned.

I am using AzureSQL with EFCore but have been running the above queries in SSMS to try to get to the bottom of the issue.

Upvotes: 4

Views: 617

Answers (2)

Thom A
Thom A

Reputation: 95564

The problem here was the lack of indexing. From your data, you only had 1 index on your table, on the Primary key, nothing else. This means, if you look at the values of the primary key, things are nice and quick, however, for other things, not so much.

When you start querying things, especially in the WHERE, ORDER, ON, etc clauses, on tables with a lot of data, and without indexes on those columns, things start to slow down. Why? Because the SQL Server doesn't know where to look for that data, so it has to check every row.

Consider your data, with it's ID column and Column A. Column A is actually a derived value of ID (Let's just go with ID % 100), however, your column is a persisted value but not calculated from ID. Then you ask the SQL Server "Can I have all the rows where the value of Column A is 0, please?" SQL Server has no idea what those rows contain, and it has no INDEX to help it, thus off it goes, checking every single one of them as it goes through all the IDs.

Now, imagine you have an INDEX on that column. Now, when you ask the Server the same question, it can look at the index. The index, will have an ordered list of Column A, and tell the Server its corresponding ID (which is where the data is stored). The SQL Server can then look at that and see that all the Column As with the value 0 have nicely been placed together in its list (Index); then it simply goes to each ID it needs. It doesn't end up checking the value of Column A for every row.

This is, of course, a very "simplistic" way of looking at Indexes. They are, in actuality, far more complex. Indexes, normally speed up getting data from your Server, but it's worth noting that they SLOW DOWN some tasks, such as INSERT. This is because when it writes the data, it also has to update the indexes. This means more IO as well, so slower discs will have performance issues as well (although they will for a SELECT as well). UPDATE commands can be faster, depending on what you're doing.

As I said, this is a very basic description; but might help you understand a little more. This is, in no way, me saying you should put an INDEX on every column. Knowing which columns to Index and how is a very important thing, but it can in no way be taught by a single Answer on SO.

Upvotes: 4

SBFrancies
SBFrancies

Reputation: 4240

Adding an index on (ColumnA, AnotherColumn) solved the problem. Now the query takes less than 1 second. Thanks to Larnu for pointing me in the right direction. I would still appreciate an answer than clarifies why the performance was so bad in the first place.

Upvotes: 2

Related Questions