20s of PAGEIOLATCH_SH on Azure SQL Database with execution plan of 0.06 subtree cost

(Azure support is giving vague answers, so I'm turning to Stack Overflow with hope! :))

Situation

User complains of timeout errors on a query. I run the same query twice (same parameters) from MSMS. First run takes looong (23s or sometimes 50s), second, third, etc. runs take <1s. Execution Plans are identical and have a subtree cost of 0.0671..

The difference I do notice is a WaitStats section in the first Execution Plan with the following values:

WaitCount: 2751
WaitTimeMS: 6
WaitType: MEMORY_ALLOCATION_EXT

WaitCount: 2751
WaitTimeMS: 6
WaitType: IO_QUEUE_LIMIT

WaitCount: 669
WaitTimeMS: 20360
WaitType: PAGEIOLATCH_SH

Azure SQL DTU is averaging around 5% max.

Azure Support says it's probably the Execution Plan compilation time. I doubt that since clearing the Proc Cache does not reintroduce the long waits after the first run.

The execution plan "leafs" are Index Seek (NonClustedred), Key Lookup (Clustered) and RID Lookup (Heap). The RID Lookup being 39% (of the 0.0671). A single row is returned (i.e. TOP 1).

3 tables are in the query with CROSS APPLY. The biggest one has 8 million rows and contains a ~40KB VARBINARY column (not referenced nor returned anywhere in the query).

Query

DECLARE @p0 VARCHAR(50); SET @p0 = '<GUID1>'

SELECT TOP 1 p.Id, p.DateCreatedUtc, p.PreviousOwnerId
FROM (
    -- last project save
    SELECT ps.Id AS psId, p.*
    FROM Projects p
    CROSS APPLY (
      SELECT TOP 1 *
      FROM ProjectSaves
      WHERE ProjectId = p.Id
      ORDER BY LastModifiedUtc DESC
    ) AS ps
    WHERE p.OwnerId = @p0
) p
CROSS APPLY (
    SELECT TOP 1 *
    FROM ProjectSavePhotos
    WHERE ProjectSaveId = p.psId AND (name LIKE 'uploads%')
) ps
WHERE P.IsDeleted = 0 AND p.Id NOT IN ('<GUID2>')
ORDER BY p.DateCreatedUtc DESC

p.OwnerId is indexed and Azure automatically created two more indexes:

on OwnerId and Id and
on IsDeleted, OwnerId and Id

ps.ProjectId is indexed and includes LastModifiedUtc

psp.ProjectSaveId is indexed and includes name

Question

How can I diagnose the root cause of the 20s PAGEIOLATCH_SH? Could it be the mere presence of the VARBINARY column? If so, how can I confirm that?

References read

https://www.sqlshack.com/handling-excessive-sql-server-pageiolatch_sh-wait-types/

https://sqlperformance.com/2014/06/io-subsystem/knee-jerk-waits-pageiolatch-sh

Upvotes: 2

Answers (3)

influent

Reputation: 1377

Unless I'm missing something this sounds normal, but I'd have to see the execution plan. You probably want to get rid of that Key Lookup. Is the Key Lookup grabbing a column that's missing from the index used in the Index Seek? If yes, add it to the index in the Index Seek and see what happens. A cross apply with such a large table can eat up a lot of buffer space, depending on how wide it is, but without seeing the query I can't say if there's a better way to get your data. This might help as well: https://stackoverflow.com/a/17572392/163072

Adding a clustered PK to ProjectSaves could also make a big difference.

Upvotes: 1

hokkaidi

Reputation: 910

Any type of data is stored as a page in SQL Server. There are different types of page depending on what the page contains (data, index, etc). See documentation page.

The fundamental unit of data storage in SQL Server is the page. The disk space allocated to a data file (.mdf or .ndf) in a database is logically divided into pages numbered contiguously from 0 to n. Disk I/O operations are performed at the page level. That is, SQL Server reads or writes whole data pages.

When you run a query against your database, the SQL Engine will look in the system memory (buffer) to see if it has all the pages required to execute your query. If some pages are missing, the SQL Engine will load them in memory from the disk.

The PAGEIOLATCH_SH wait corresponds to pages being pulled from disk to memory (buffer). Once pages have been loaded in system memory, they remain there until evicted. This is the reason why the first run of your query takes more time than the subsequent runs. During the first run, the SQL Engine needs to retrieve data from disk. For subsequent run, this is no longer the case.

To reduce the first query wait, there are different strategies. As mentioned by Alberto, if you run this query often, then the pages are less likely to be evicted from the buffer. As mentioned by influent, you can rewrite your query or create new indexes so that the SQL Engine does not have to load so many pages. Please post the query plan for further investigation.

Upvotes: 2

Alberto Morillo

Reputation: 15688

What you are seeing is the effect of SQL Azure Database shrinking memory allocation after the database has not been used for some time or the database tier has been scaled. The behavior is exactly as you mention, the first execution or first few executions run with poor performance until memory allocation goes back to normal. You won't see that on a database continuously used.

That behavior on memory allocation creates those waits you are seeing during the first executions of the query, and you don't see that behavior with Microsoft SQL Server. For things like these I commonly say that Azure SQL Database and SQL Server are not the same, they are different in many ways.