Reputation: 2888
The DBs corresponding to the different environments (e.g. production, staging) of our web app are on the same Azure SQL database server. While I read everywhere that an Azure SQL server is just a logical container (and the DBs on it may not even be on the same physical machine) we see signs of DBs behaving as noisy neighbors, i.e. operations made on one of them affecting the performance of others.
We've seen the below operations and metrics happening correlate between DBs. Let's call one DB "prod" and the other "stage"; in all cases stage was created by copying prod with the Start-AzureWebAppSqlDatabaseCopy
PowerShell commandlet.
With both DBs we use separate DB-level user accounts (on why, see this SO post), but the prod and stage user accounts both exist under both DBs (i.e. we use the stage user to connect to the stage DB, but the stage user also exists under the prod DB, and the prod user also exists under the stage DB). We dropped the stage user from the prod DB to see if that makes a difference, but it didn't.
It may be worth noting that when the Web/Business Azure SQL tiers were phase out these DBs were migrated from Web to their current S1 tier. We see the same issue with DBs on another server too. The DBs are NOT part of an elastic pool.
Our findings are inconclusive and these events don't correlate 100% of the time either. We're out of ideas what to investigate, as we are sure that the stage app doesn't connect to the prod DB. We tried to find evidence of the stage app somehow affecting the prod DB but we couldn't. Any input would be appreciated.
Update 1
Using Grant's sys.dm_os_wait_stats
tip, as well as sys.dm_os_performance_counters
it is evident that yes, if you make a copy of the database on the same logical server it will be created on the same physical SQL Server too. Server name in object_name
is the same, wait values are exactly the same.
This doesn't explain however, why operations on the copy affect the original DB. Since it seems that the noisy neighbor effect doesn't happen all the time (the scale up does affect the original DB most of the time, the perf-heavy operations less so, but the correlation is still pronounced) it might be some random Azure problem.
We'll see if using a different logical server fixes the issue. What's sure is that in that case the physical server will also be different, we've checked that.
Update 2
We're monitoring the situation but whether this indeed solves the issue will most possibly be apparent only after several months. For now we have put all DBs on separate servers.
We did notice timeouts on the prod DB always in the same time interval after all operations on the stage DB completed. These timeouts however seem to only happen for table creations. It's like after copying the prod DB to the stage DB the prod DB is somewhat "locked" for a period of time (about 45-60 minutes) and you can't create tables (but you can drop them, those work). Funnily enough this didn't happen today, so maybe it has resolved itself...
Upvotes: 3
Views: 1437
Reputation: 15694
From the information you provide I suspect the issue is the workload of your databases is occasionally I/O intensive, is hitting the tier limits and Azure SQL starts to throttling. That throttling may be behind those timeouts.
Please monitor resource consumption using below query:
SELECT
(COUNT(end_time) - SUM(CASE WHEN avg_cpu_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'CPU Fit Percent'
,(COUNT(end_time) - SUM(CASE WHEN avg_log_write_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'Log Write Fit Percent'
,(COUNT(end_time) - SUM(CASE WHEN avg_data_io_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'Physical Data Read Fit Percent'
FROM sys.dm_db_resource_stats
Service level objective (SLO) of 99.9% <= go to next tier.
Measure DTU consumption over time. Are you getting timeouts when the following query shows high DTU usage.
SELECT start_time, end_time,
(SELECT Max(v) FROM (VALUES (avg_cpu_percent), (avg_data_io_percent),
(avg_log_write_percent)) AS value(v)) as [avg_DTU_percent]
FROM sys.resource_stats where database_name = 'AdventureWorksLT' order by end_time desc
Compare DTU usage vs DTU limit.
SELECT
end_time AS [EndTime]
, (SELECT Max(v) FROM (VALUES (avg_cpu_percent), (avg_data_io_percent), (avg_log_write_percent)) AS value(v)) AS [AvgDTU_Percent]
, ((dtu_limit)*((SELECT Max(v) FROM (VALUES (avg_cpu_percent), (avg_data_io_percent), (avg_log_write_percent)) AS value(v))/100.00)) AS [AvgDTUsUsed]
, dtu_limit AS [DTULimit]
FROM sys.dm_db_resource_stats
Upvotes: 2
Reputation: 2805
The way I would go about determining if this was the case would be to use sys.dm_os_wait_stats along with sys.dm_db_wait_stats. The OS wait stats are for the "server" that your database is running on and the db wait stats are for the database. Gather the db waits for both databases in question and the os waits for both databases. First, straight up compare the os waits. if they're the same (with some margin, I wouldn't expect them to be exactly the same, although, if they are, there's your answer), you may be seeing everything on the same server. If they're not really the same, but sort of similar, then compare the db wait stats for each database to the OS wait stats and see if you can see a direct correlation.
For management purposes only, I would probably separate these on to different servers anyway, even if this wasn't normally an issue. However, if you can find a correlation, then the best bet is probably to break apart the servers. It won't cost you anything. You pay for the database, not the server.
Upvotes: 1