Microsoft – SQL Server – SQL Instance Crashes with “ex_terminator: Possible termination due to exception”

There is nothing like nice SQL Server Instance crashes to bring the charade of a know it all DBA to a timely end.

Everything going well at the office … That is until this morning when a co-worker emailed letting me know to check a few disks, as they were almost full.

So checked and confirmed that the disks were in-fact almost full.  Why full — well database mirroring session was suspended, as well.

Where do I go from here ..

Restarted Database mirroring, but it kept stopping.

What is next, checked sql server error log on the partner log.  And, saw that I was having I/O latency problems.  Will deal with the Storage bit later … Though slow, the SQL Instance should not terminate abruptly.


2013-02-21 22:53:37.39 spid5s      SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [Y:\SQLServer\LogFiles\Sales_LogExt02.ldf] in database [Sales] (5).  The OS file handle is 0x0000000000000D4C.  The offset of the latest long I/O is: 0x0000214b712800
2013-02-21 23:00:33.11 spid5s      SQL Server has encountered 5 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [Y:\SQLServer\LogFiles\Sales_LogExt02.ldf] in database [Sales] (5).  The OS file handle is 0x0000000000000D4C.  The offset of the latest long I/O is: 0x0000214b73d200
2013-02-21 23:26:34.70 spid5s      SQL Server has encountered 2 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [Y:\SQLServer\LogFiles\Sales_LogExt02.ldf] in database [Sales] (5).  The OS file handle is 0x0000000000000D4C.  The offset of the latest long I/O is: 0x0000214b71a400
2013-02-21 23:43:59.15 spid5s      SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [Y:\SQLServer\LogFiles\Sales_LogExt02.ldf] in database [Sales] (5).  The OS file handle is 0x0000000000000D4C.  The offset of the latest long I/O is: 0x0000214b71a400
2013-02-21 23:59:50.37 spid29s     Error: 823, Severity: 24, State: 6.
2013-02-21 23:59:50.37 spid29s     The operating system returned error 1117(The request could not be performed because of an I/O device error.) to SQL Server during a write at offset 0x0000214b71a400 in file 'Y:\SQLServer\LogFiles\Sales_LogExt02.ldf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
2013-02-21 23:59:50.38 spid29s     Error: 1454, Severity: 16, State: 1.
2013-02-21 23:59:50.38 spid29s     Database mirroring will be suspended. Server instance 'HRDB' encountered error 823, state 6, severity 24 when it was acting as a mirroring partner for database 'Sales'. The database mirroring partners might try to recover automatically from the error and resume the mirroring session. For more information, view the error log for additional error messages.
2013-02-21 23:59:50.39 spid29s     Error: 823, Severity: 24, State: 6.
2013-02-21 23:59:50.39 spid29s     The operating system returned error 1117(The request could not be performed because of an I/O device error.) to SQL Server during a write at offset 0x0000214b71a400 in file 'Y:\SQLServer\LogFiles\Sales_LogExt02.ldf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
2013-02-21 23:59:50.39 spid29s     Error: 823, Severity: 24, State: 6.
2013-02-21 23:59:50.39 spid29s     The operating system returned error 1117(The request could not be performed because of an I/O device error.) to SQL Server during a write at offset 0x0000214b71a400 in file 'Y:\SQLServer\LogFiles\Sales_LogExt02.ldf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
2013-02-21 23:59:50.39 spid29s     ex_terminator: Possible termination due to exception during stack unwinding.
2013-02-21 23:59:50.42 spid29s     Using 'dbghelp.dll' version '4.0.5'
2013-02-21 23:59:50.43 spid29s     **Dump thread - spid = 0, EC = 0x00000000811A6B20
2013-02-21 23:59:50.43 spid29s     ***Stack Dump being sent to d:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\LOG\SQLDump0001.txt
2013-02-21 23:59:50.43 spid29s     * *******************************************************************************
2013-02-21 23:59:50.43 spid29s     *
2013-02-21 23:59:50.43 spid29s     * BEGIN STACK DUMP:
2013-02-21 23:59:50.43 spid29s     *   02/21/13 23:59:50 spid 29
2013-02-21 23:59:50.43 spid29s     *
2013-02-21 23:59:50.43 spid29s     * ex_terminator - Last chance exception handling
2013-02-21 23:59:50.43 spid29s     *
2013-02-21 23:59:50.43 spid29s     *  
2013-02-21 23:59:50.43 spid29s     *
2013-02-21 23:59:50.43 spid29s     *  MODULE                          BASE      END       SIZE
2013-02-21 23:59:50.43 spid29s     * sqlservr                       00000000008D0000  00000000044E1FFF  03c12000

2013-02-21 23:59:50.43 spid29s     *
2013-02-21 23:59:50.43 spid29s     *     P1Home: 0044004C00510053:  
2013-02-21 23:59:50.43 spid29s     *     P2Home: 00000000005A02B4:  0022002500210023  0024002900230027  0026002D0025002B  002800310027002F  002A003500290033  002C0039002B0037  
2013-02-21 23:59:50.43 spid29s     *     P3Home: 00000000005A1D20:  000000000052AD80  000000000052AD80  0000000000000000  0000000000000000  0000000000000000  0000000000000000  
2013-02-21 23:59:50.43 spid29s     *     P4Home: 0000000000000000:  
2013-02-21 23:59:50.43 spid29s     *     P5Home: 0052004500AE00AC:  
2013-02-21 23:59:50.43 spid29s     *     P6Home: 0000000004EF0D38:  0050005C003A0064  00720067006F0072  00460020006D0061  00730065006C0069  00630069004D005C  006F0073006F0072  
2013-02-21 23:59:50.43 spid29s     * ContextFlags: 000000000010000F:  
2013-02-21 23:59:50.43 spid29s     *      MxCsr: 0000000000001FA0:  
2013-02-21 23:59:50.43 spid29s     *      SegCs: 0000000000000033:  
2013-02-21 23:59:50.43 spid29s     *      SegDs: 000000000000002B:  
2013-02-21 23:59:50.43 spid29s     *      SegEs: 000000000000002B:  
2013-02-21 23:59:50.43 spid29s     *      SegFs: 0000000000000053:  
2013-02-21 23:59:50.43 spid29s     *      SegGs: 000000000000002B:  
2013-02-21 23:59:50.43 spid29s     *      SegSs: 000000000000002B:  
2013-02-21 23:59:50.43 spid29s     *     EFlags: 0000000000000202:  
2013-02-21 23:59:50.43 spid29s     *        Rax: 00000000539324E5:  
2013-02-21 23:59:50.43 spid29s     *        Rcx: 00000000249E7520:  0044004C00510053  00000000005A02B4  00000000005A1D20  0000000000000000  0052004500AE00AC  0000000004EF0D38  
2013-02-21 23:59:50.43 spid29s     *        Rdx: 0000000000000000:  
2013-02-21 23:59:50.43 spid29s     *        Rbx: 0000000000000000:  
2013-02-21 23:59:50.43 spid29s     *        Rsp: 00000000249E7B30:  0000000000000000  00000000811A6B20  00000000249E7D60  000007FFFFF5A000  00000000000042AC  0000000000000000  
2013-02-21 23:59:50.43 spid29s     *        Rbp: 00000000249EBA40:  0000000002C8CCC0  00000000249EBB00  0000000000000103  00000000249EBF98  FFFFFFFF00000000  00000000008D0000  
2013-02-21 23:59:50.43 spid29s     *        Rsi: 00000000811A6B20:  00000000811A6B20  00000000811A6A20  0000000000000000  0000000000000000  00000000811A6B40  00000000811A6B40  
2013-02-21 23:59:50.43 spid29s     *        Rdi: 00000000249E7D60:  0000000004EF0D30  000007FEFE9B1130  00000000FFFFFFFE  00000000249E7E08  00000000249E7DF6  00000000008D0000  
2013-02-21 23:59:50.43 spid29s     *         R8: 0000000000000000:  
2013-02-21 23:59:50.43 spid29s     *         R9: 0000000000000000:  
2013-02-21 23:59:50.43 spid29s     *        R10: 0000000000000000:  
2013-02-21 23:59:50.43 spid29s     *        R11: 00000000249E8340:  0044004C00510053  00000000005A02B4  00000000005A1D20  0000000000000000  0052004500AE00AC  0000000004EF0D38  
2013-02-21 23:59:50.43 spid29s     *        R12: 0000000000000001:  
2013-02-21 23:59:50.43 spid29s     *        R13: 00000000811A6A20:  00000000811A6080  00000000811A6240  0000000000000000  00000000811A69D0  00000000811A6A90  00000000810A25C0  
2013-02-21 23:59:50.43 spid29s     *        R14: 0000000002DF2C80:  0074005F00780065  0069006D00720065  006F00740061006E  0020002D00200072  007400730061004C  0061006800630020  
2013-02-21 23:59:50.43 spid29s     *        R15: 000000000000002F:  
2013-02-21 23:59:50.43 spid29s     *        Rip: 000007FEFD08A49D:  C3000000C8C48148  003B830874F68548  0001BD1689660376  02D6840FFD3B0000  F3840F02FF830000  870F07FF83000002  
2013-02-21 23:59:50.43 spid29s     * *******************************************************************************
2013-02-21 23:59:50.43 spid29s     * -------------------------------------------------------------------------------
2013-02-21 23:59:50.43 spid29s     * Short Stack Dump
2013-02-21 23:59:50.47 spid29s     000007FEFD08A49D Module(KERNELBASE+000000000000A49D)
2013-02-21 23:59:50.47 spid29s     00000000029A974D Module(sqlservr+00000000020D974D)
2013-02-21 23:59:50.47 spid29s     00000000029ADC86 Module(sqlservr+00000000020DDC86)
2013-02-21 23:59:50.47 spid29s     00000000029AD7E9 Module(sqlservr+00000000020DD7E9)

.....
2013-02-21 23:59:50.51 spid29s     0000000076FCC521 Module(ntdll+000000000002C521)
2013-02-21 23:59:50.52 spid29s     Stack Signature for the dump is 0x000000016A13D61A
2013-02-21 23:59:51.59 spid29s     External dump process return code 0x20000001.
External dump process returned no errors.

Looked more at the errorlog and found a few google-able data:

  • ex_terminator – Last chance exception handling
  • spid29s     0000000076FCC521 Module(ntdll+000000000002C521)
  • spid29s     Stack Signature for the dump is 0x000000016A13D61A
  • spid29s     External dump process return code 0x20000001.

The spid stayed at 29, who is spid 29

spid 29

So I know that SPID belongs to “BRKR Task”.
Next, what is BRKR Task?

Seeing Remus Rasanu’s name on the first link, let me know to believe that Q/A and know I was not going to get any better answer on the Internet.

So BRKR is part of Service broker.  And, I know I am heavily using Database Mirroring, I knew the problem is probably somewhere there.
And, did I mention that the offending SQL Instance is a passive DB Mirror Node.

Googled a bit on 0x20000001 and found one of my own :

https://danieladeniji.wordpress.com/2012/10/06/microsoft-sqlserver-error-non-yielding-scheduler-external-dump-process-return-code-0x20000001/

I don’t believe in I as much as I believe in Remas Rasanu, but I have to go with applying the latest patch.

Before was on MS SQL Server v2008/R2 -> SP1   [10.50.4000]

Now on v2008/R2 -> SP1 ( Cumulative patch 5) [10.50.4276]

Per Googling for ex_terminator, Graham Kent has a good blog post, as well:

http://blog.basefarm.com/blog/2011/04/28/sql-server-2008-crash-ex_terminator-last-chance-exception-handling/

One of the other errors, we were experiencing says:

2013-04-12 01:15:27.79 spid28s Database HRDB has more than 1000 virtual log files which is excessive. Too many virtual log files can cause long startup and backup times. Consider shrinking the log and using a different growth increment to reduce the number of virtual log files. 

And, that error is true, as well.

So let us determine how many VLFS we have:


if object_id('tempdb..#LogInfo') is not null 
begin 
	drop table #LogInfo 
end 

CREATE TABLE #LogInfo
(
	  FileID	  BIGINT
	, FileSize	  BIGINT
	, StartOffset BIGINT
	, FSeqNo	  BIGINT
	, [Status]    BIGINT
	, Parity	  BIGINT
	, CreateLSN   VARCHAR(50) 
) 

insert into #LogInfo 
EXECUTE ('DBCC LOGINFO') 

select 
		  tblSysfile.name
		, tblLogInfo.fileid
		, count(*) as cnt 

from #LogInfo tblLogInfo 

inner join sys.database_files tblSysfile 
		on tblLogInfo.FileID = tblSysfile.[file_id]

group by 
		  tblSysfile.name
		, tblLogInfo.fileid 

order by COUNT(*) desc 

drop table #LogInfo 

Output of VLF Count:
getVLFCount

So we have a lot if VLFS especially on the originally created database log files.  How can I tell which log files were part of the first batch created, by looking at the file id (2, 7, 6, 8).

How big are they and how heavily used are they:


--use database

select 
      DB_NAME(DB_ID()) as databaseName
    , name 
    , physical_name
    , (size / 128) as sizeInMB
    , ((FILEPROPERTY(name, 'spaceUsed'))) / 128 as spaceUsed	
    , ( size - (FILEPROPERTY(name, 'spaceUsed'))) / 128 as avail

from    sys.database_files tblDatabaseFile		

where

	(
	   (tblDatabaseFile.type = 1)

	)

Database Log file usage Stats

 

Confirming that our database log files are actually full, I had no movement.

As we had available storage already allocated to our physical drives, but not in use by specifically pre-sized database log files, went ahead and created a new log file.


--use database

if FILE_ID(N'HRDB_LogExt04') is null
begin

    print 'Adding Log File (HRDB_LogExt04)' 					

  ALTER DATABASE [HRDB] 
   ADD LOG FILE ( 
		  NAME = N'HRDB_LogExt04'
		, FILENAME = N'Y:\LogFiles\HRDB_LogExt04.ldf' 
		, SIZE = 50GB
		, FILEGROWTH = 0
	       )

    print 'Added Log File (HRDB_LogExt04) '					

end

GO

Unfortunately, this is a database log file and not a database data file, and so “Instant File Initialization” (Perform Volume Maintenance tasks) will not help you this time.

Why will it not help, you ask ? Well I read somewhere that database log files have to be completely nulled-out; this prevents cyclic redundancy problems.

And, so take a break, hopefully not a smoke one, and wait for your database log file to be created.

Once the database log file has been created, try to resume mirroring.

The best way to determine whether Database Mirroring is to use Microsoft Windows \ Performance Monitor and track on the following metrics (on the passive node):

  • SQLServer:Database Mirroring / “Redo Bytes/sec”
  • SQLServer:Database Mirroring / “Redo Queue KB”

Database Mirroring - Performance Monitor

 

How can you prevent this type of problems in the future?  One wise man once smirked that through  a lot of prayers and fasting.

Something about being fastidious…

But, honestly:

  • Address your Virtual Log File issues by properly scoping your database log file growths
  • Monitor database mirroring
  • Monitor sql server in general
  • Stay on friendly terms with your Storage Engineers
  • Apply Microsoft SQL Server patches (yes cumulative patches, as well)
  • Apply OS Storage patches
  • Apply Storage Vendor (HBA) driver patches

References:

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s