Chris
Chris

Reputation: 1229

Microsoft SQL sys.dm_os_wait_stats add up to more milliseconds than in the time period

The point of this post is to ultimately determine the CPU and IO utilization of a SQL server box. Traditionally we'd use @@cpu_busy, @@io_busy and @@idle to determine that however on MSSQL, those quit working after 28 days. We have the CPU utilization from a different source on the box, but we need to determine the IO bound.

When looking at the data in sys.dm_os_wait_stats and calculating deltas every ten minutes, the amount of the waited seconds can exceed ten minutes. I also tried dividing by the waited tasks but still the data doesn't make sense.
Basically, we want to have each wait type turned into a percentage wait for the ten minute period. But if the amount of waiting exceeds the ten minutes then one cannot simply divide the time by 10 minutes to see the percentage utilized.

We are trying to determine a metric to show how IO bound a box is.

https://msdn.microsoft.com/en-us/library/ms179984.aspx

wait_type = Name of the wait type. For more information, see Types of Waits, later in this topic.

waiting_tasks_count = Number of waits on this wait type. This counter is incremented at the start of each wait.

wait_time_ms = Total wait time for this wait type in milliseconds.

EDIT

The first answer is on the right track, but not quite it. What that statistic shows is that for that given time interval, the percentage of the wait that is attributable to any one particular wait type. See in the following graph.

Graph of deltas compared to iobusy

EDIT

Correlation Matrix based on deltas over 10 minute intervals:

                 wait_time_ms wait.NO.signal signal_wait_time @@io_busy @@cpu_busy ioPct cpuPct
wait_time_ms              100            100               70      74           58    71     58
wait.NO.signal            100            100               64      72           53    69     53
signal_wait_time           70             64              100      71           89    67     89
@@io_busy                  74             72               71     100           77    99     77
@@cpu_busy                 58             53               89      77          100    75    100
ioPct                      71             69               67      99           75   100     75
cpuPct                     58             53               89      77          100    75    100

On the above chart, one can see that the signal time is most highly correlated with @@cpu_busy tick counter deltas. The wait time is most correlated with @@io_busy counter deltas.

According to the @@vars, this SQL box is cpu bound (cpu% a lot higher than io%) while "according" to the wait stats, it is IO bound. According to the sys.dm_os_ring_buffers that box is CPU bound. I believe what the SystemHealth/SystemIdle says.

This article suggests that one can use the signal wait time vs the wait time in order to get CPU pressure%. However, compared to the @@cpu_busy data, I strongly suspect his conclusion is only partially correct. If his cpuPressure% is high, yes adding CPU power will help but it is not the whole story. http://blogs.msdn.com/b/sqlcat/archive/2005/09/05/461199.aspx

                 wait_time_ms cpuPress wait.NO.signal signal_wait_time   @@io_busy   @@cpu_busy ioPct cpuPct
cpuPress                  -50      100            -56               25         -11           25   -11     25

EDIT

The following works for one of the chosen boxes, however given the different cores, we will have to factor that in.


summary(m)

Call:
lm(formula = ioPct ~ cpuPct + signal_wait_time + wait_time_ms, 
    data = rd)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.13921 -0.75004 -0.07748  0.60897  2.14655 

Coefficients:
                        Estimate      Std. Error t value       Pr(>|t|)    
(Intercept)      -0.442311370934  0.085652949324  -5.164 0.000000286383 xxx
cpuPct            0.123717691895  0.004503995334  27.468        less 2e-16 xxx
signal_wait_time -0.000000302969  0.000000046933  -6.455 0.000000000161 xxx
wait_time_ms      0.000000022240  0.000000002534   8.777        less 2e-16 xxx
---
Signif. codes:  0 ‘xxx’ 0.001 ‘xx’ 0.01 ‘x’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9959 on 1109 degrees of freedom
Multiple R-squared:  0.7566,    Adjusted R-squared:  0.7559 
F-statistic:  1149 on 3 and 1109 DF,  p-value: 

Upvotes: 1

Views: 508

Answers (2)

Remus Rusanu
Remus Rusanu

Reputation: 294387

The wait stats are registered by a worker (ie. a thread). If, during a single second, 10 thread suspend (each) for 0.9 seconds and execute for 0.1 seconds, you have 10x0.9 total wait time (ie. 9 seconds) and 10x0.1 total CPU time (the elapse 1 second). 100%cpu and high wait time (for whatever wait).

High signal wait time indicate CPU contention. The thread needs to wait a longer time that needed after the wait is signaled, before the thread gets a chance to run.

Upvotes: 1

Scott Ivey
Scott Ivey

Reputation: 41568

On a multi-core machine, each CPU has its own scheduler, and can register its own waits. For example - if you have a query that runs parallel across multiple CPUs, each of the parts of the parallel query can register its own CXPACKET waits while they wait for the query to complete.

To get your percentage utilized, just divide by your total wait times, rather than the 10 minutes.

SELECT wait_type,
       wait_time_ms * 100.00 / SUM(wait_time_ms) OVER()
FROM sys.dm_os_wait_stats

Upvotes: 1

Related Questions