Microsoft – SQL Server – Wait Stats – Latch I/O

Microsoft – SQL Server – Wait Stats – Latch I/O

Reviewing MS SQL Server Performance Metrics and one of the things that I am seeing are longs wait for Page IO Latches.

Per-using Paul Randal’s wait stats investigation sql.  That sql is available @
http://www.sqlskills.com/blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/


/*
Wait statistics, or please tell me where it hurts
http://www.sqlskills.com/blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/
*/
WITH [Waits]
AS (
	SELECT [wait_type]
		,[wait_time_ms] / 1000.0 AS [WaitS]
		,([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS]
		,[signal_wait_time_ms] / 1000.0 AS [SignalS]
		,[waiting_tasks_count] AS [WaitCount]
		,100.0 * [wait_time_ms] / SUM([wait_time_ms]) OVER () AS [Percentage]
		,ROW_NUMBER() OVER (
			ORDER BY [wait_time_ms] DESC
			) AS [RowNum]
	FROM sys.dm_os_wait_stats
	WHERE [wait_type] NOT IN (
			N'CLR_SEMAPHORE'
			,N'LAZYWRITER_SLEEP'
			,N'RESOURCE_QUEUE'
			,N'SQLTRACE_BUFFER_FLUSH'
			,N'SLEEP_TASK'
			,N'SLEEP_SYSTEMTASK'
			,N'WAITFOR'
			,N'HADR_FILESTREAM_IOMGR_IOCOMPLETION'
			,N'CHECKPOINT_QUEUE'
			,N'REQUEST_FOR_DEADLOCK_SEARCH'
			,N'XE_TIMER_EVENT'
			,N'XE_DISPATCHER_JOIN'
			,N'LOGMGR_QUEUE'
			,N'FT_IFTS_SCHEDULER_IDLE_WAIT'
			,N'BROKER_TASK_STOP'
			,N'CLR_MANUAL_EVENT'
			,N'CLR_AUTO_EVENT'
			,N'DISPATCHER_QUEUE_SEMAPHORE'
			,N'TRACEWRITE'
			,N'XE_DISPATCHER_WAIT'
			,N'BROKER_TO_FLUSH'
			,N'BROKER_EVENTHANDLER'
			,N'FT_IFTSHC_MUTEX'
			,N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP'
			,N'DIRTY_PAGE_POLL'
			, N 'SP_SERVER_DIAGNOSTICS_SLEEP'
			)
	)
SELECT [W1].[wait_type] AS [WaitType]
	,CAST([W1].[WaitS] AS DECIMAL(14, 2)) AS [Wait_S]
	,CAST([W1].[ResourceS] AS DECIMAL(14, 2)) AS [Resource_S]
	,CAST([W1].[SignalS] AS DECIMAL(14, 2)) AS [Signal_S]
	,[W1].[WaitCount] AS [WaitCount]
	,CAST([W1].[Percentage] AS DECIMAL(4, 2)) AS [Percentage]
	,CAST(([W1].[WaitS] / [W1].[WaitCount]) AS DECIMAL(14, 4)) AS [AvgWait_S]
	,CAST(([W1].[ResourceS] / [W1].[WaitCount]) AS DECIMAL(14, 4)) AS [AvgRes_S]
	,CAST(([W1].[SignalS] / [W1].[WaitCount]) AS DECIMAL(14, 4)) AS [AvgSig_S]
FROM [Waits] AS [W1]
INNER JOIN [Waits] AS [W2] 
      ON [W2].[RowNum] = [W1].[RowNum]

GROUP BY [W1].[RowNum]
	,[W1].[wait_type]
	,[W1].[WaitS]
	,[W1].[ResourceS]
	,[W1].[SignalS]
	,[W1].[WaitCount]
	,[W1].[Percentage]
HAVING SUM([W2].[Percentage]) - [W1].[Percentage] > 95;-- percentage threshold
GO



And, ended up with the result pasted below.

Wait Stats - Categorized by wait type

So interesting enough, our biggest factors for Wait Stats are:

  • PageIOLatch_SH –> 75%
  • PageIOLatch_EX –> 12%
  • DBMIRRORING_CMD –> 3%

Having such high numbers for PageIOLatch is a bit “on the outer edge”.

And, so decided to dig a bit better.

Having previously read “Diagnosing and Resolving Latch Contention on SQL Server” and noted a couple of SQL. Per-used those sql to get Latch Stats ordered by wait_time_ms in descending order.


select *
from   sys.dm_os_latch_stats
where  latch_class != 'BUFFER'
order by wait_time_ms desc
;

And, ended up with the result pasted below:

Latch Wait Stats - grouped by latch class

 

Here is Paul Randal’s SQL Script that best aggregates latches and shows them as percentiles :
http://www.sqlskills.com/blogs/paul/most-common-latch-classes-and-what-they-mean/

 


WITH Latches AS
    (
        SELECT
           latch_class,
           wait_time_ms / 1000.0 AS WaitS,
           waiting_requests_count AS WaitCount,
           100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,
           ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum
        FROM sys.dm_os_latch_stats
        WHERE latch_class NOT IN ( 'BUFFER')
        AND wait_time_ms > 0
    )
SELECT
    W1.latch_class AS LatchClass, 
    CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,
    W1.WaitCount AS WaitCount,
    CAST (W1.Percentage AS DECIMAL(14, 2)) AS Percentage,
    CAST ((W1.WaitS / W1.WaitCount) AS DECIMAL (14, 4)) AS AvgWait_S
FROM Latches AS W1
INNER JOIN Latches AS W2
    ON W2.RowNum = W1.RowNum 
WHERE W1.WaitCount > 0
GROUP BY W1.RowNum, W1.latch_class, W1.WaitS, W1.WaitCount, W1.Percentage
--HAVING SUM (W2.Percentage) – W1.Percentage > 95; -— percentage threshold
GO

Latch Wait Stats - grouped by latch class (percentile)

Obviously, ALLOC_IAM_PAGE_RANGE_CACHE is the worst LATCH CLASS.

What to do — Google for it.  And, unfortunately this time, not much community work.

So please help!

Also, to see if latch wait stats rests principally on few specific tables and indexes, please issue the following:


/*
        source : Diagnosing and Resolving Latch Contention on SQL Server
	Isolating the Object Causing Latch Contention
	The script below uses the resource_description column to 
	isolate which index was causing the PAGELATCH_EX contention

*/
SELECT 
		  wt.session_id
		, wt.wait_type
		, wt.wait_duration_ms 
		, s.name AS schema_name
		, o.name AS object_name
		, i.name AS index_name
FROM sys.dm_os_buffer_descriptors bd 
  JOIN (
	  SELECT *
		--resource_description
	    , CHARINDEX(':', resource_description) AS file_index
	    , CHARINDEX(':', resource_description
                , CHARINDEX(':', resource_description)+1) AS page_index
	    , resource_description AS rd
	  FROM sys.dm_os_waiting_tasks wt 
		  WHERE (
		               (wait_type LIKE 'PAGELATCH%')
		            or (wait_type LIKE 'PAGEIOLATCH%')
		        )    
	  ) AS wt
		ON bd.database_id = SUBSTRING(wt.rd, 0, wt.file_index)
		AND bd.file_id = SUBSTRING(wt.rd, wt.file_index+1, 1) 
		AND bd.page_id = 
                   SUBSTRING(wt.rd, wt.page_index+1, LEN(wt.rd))

JOIN sys.allocation_units au 
		ON bd.allocation_unit_id = au.allocation_unit_id 
JOIN sys.partitions p 
		ON au.container_id = p.partition_id
JOIN sys.indexes i 
		ON p.index_id  = i.index_id 
		AND p.object_id = i.object_id 
JOIN sys.objects o 
		ON i.object_id = o.object_id
JOIN sys.schemas s 
		ON o.schema_id = s.schema_id 

order by wt.wait_duration_ms desc

BTW, if you do find a fix. Implement it, clear your wait and latch stats, wait a couple of hours and re-run your investigative scripts.

To clear your wait stats:
http://msdn.microsoft.com/en-us/library/ms189768.aspx

--clear wait stats
DBCC SQLPERF("sys.dm_os_wait_stats",CLEAR)
  with no_infomsgs;

--clear latch stats
DBCC SQLPERF("sys.dm_os_latch_stats",CLEAR)
  with no_infomsgs;

References:

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s