Microsoft – SQLServer – Error 829, Status 1, Severity 21

Seeing a couple of errors on our DB:

'TCP://DBSales002:5022', the remote mirroring partner for database 'InStoreSales', encountered error 829, status 1, severity 21. Database mirroring has been suspended. Resolve the error on the remote server and resume mirroring, or remove mirroring and re-establish the mirror server instance.

Error: 1453, Severity: 16, State: 1.

Upon Seeing these errors opted to research them a bit more. As for Error 829, here is MS Write-up:

How to troubleshoot Error 3414 and a failed database recovery with SQL Server http://support.microsoft.com/kb/2015741

In the case of a committed transaction that should be rolled forward, the page can be marked inaccessible (any future attempts to access the page result in Msg 829) and recovery can be allowed to complete. In this situation, the error must be corrected by restoring the page from a backup or deallocating the page by using DBCC CHECKDB with repair.

So from my reading, it appears that some entries were pre-written to the log files, but a System Crash prevented those entries from being propagated to the database files.  Upon system reboot, the database engine is attempting to roll forward ( read the Database Log and harden the changes on to the database datafiles), but it appears to be having problem doing so. So what to do?

  • Determine the extent of the corruption
  • Consider repair avenues

Determine the extent of Corruption:

There are a couple of paths that one can consider to determine extent of corruption.

  1. Read through SQL Server Error Log
  2. Perform a select against msdb.dbo.suspect_pages

Though the “suspect_pages” has been SQL Server since v2005, I only started reading about it. Yet, it offers quite a granular and very useful insight into SQL Server’s own understanding and discovery of internal database files and page corruption.

select 
	  DB_NAME(tblSuspect.database_id) as dbName
	, tblFile.name as fileSymbolicName			  
	, tblFile.physical_name as fileName
	, tblSuspect.database_id
	, tblSuspect.file_id
	, tblSuspect.page_id
	, tblSuspect.event_type		
	, case
		when (event_type = 1) then '824 errors'
		when (event_type = 2) then 'Bad Checksum'
		when (event_type = 3) then 'Torn Page'	
		when (event_type = 4) then 'Restored'
		when (event_type = 5) then 'Repaired'
		when (event_type = 6) then 'Deallocated by DBCC'
	  end as EventTypeLiteral		
	, tblSuspect.error_count
	, tblSuspect.last_update_date

from   [msdb]..suspect_pages tblSuspect

	  inner join sys.master_files tblFile

	on tblSuspect.database_id = tblFile.database_id
	and tblSuspect.file_id = tblFile.file_id

order by tblSuspect.last_update_date desc

The table is concise and has the following columns:

  1. database_id
  2. file_id
  3. page_id
  4. event_type
  5. error_count
  6. last_update_date

Using “Understanding & Managing suspect pages” web page, http://msdn.microsoft.com/en-us/library/ms191301(v=sql.90).aspx, was able to determine that:

  1. We have at least 5 pages that are corrupted
  2. The corruption is limited to a single database – database id –> 5.  To determine the database name, issue db_name(<database-id>)
  3. Also able to determine when the corruption was detected; fairly recently
  4. One of the errors is already corrected via “DBCC”
  5. The are at least 4 pages that continue to have unresolved errors — The error type reads “Bad Checksum”

What is “Bad Checksum”? CheckSum and parity bits are used extensively in electronics transmission to detect & validate errors during data transmission. They are also used during data storage & retrieval to detect tampering.

Database Properties
http://msdn.microsoft.com/en-us/library/ms188124.aspx

Page Verify Specify the option used to discover and report incomplete I/O transactions caused by disk I/O errors. Possible values are None, TornPageDetection, and Checksum.

So in our case, it appears that SQL Server can not vouch for the accuracy of some of its pages. Next to do is to determine which database objects are corrupted and decide whether they can can be salvaged. To determine which objects are corrupted, googled for it and settled on the “never miss” assist by Elizabeth Redei. Personally, one of my favorite Blog postings of all time is Elizabeth’s web posting:

Lies, Damned Lies And Statistics http://sqlblog.com/blogs/elisabeth_redei/archive/2009/08/10/lies-damned-lies-and-statistics-part-i.aspx

Elisabeth RedeiJan 9, 2009
Hi, You can use the un-everything command (well documented on the web though) DBCC PAGE (dbid, fileid, pageid) as in:dbcc page (15, 1,2236) with tableresults HTH /Elisabeth

So listening to Elizabeth tried out:


declare @databaseID int
declare @fileID int
declare @pageID int

set @databaseID = 5
set @fileID = 4
set @pageID = 39283416

dbcc page(@databaseID, @fileID, @pageID) 
     with tableresults

 

based on running “DBCC Page”, was able to trace things back to Object ID 99

 

 

 

In our case, the database is not accessible and so one has to use a different database.  From the Object 99, one is able to deduce that that the object is a system table. And, from the Index ID 0, we are targeting the data pages themselves, and not Index Pages. So trouble… Yes, a bit.  But Googled a bit and knew what our options are. What did we find: Well Paul Randal’s online response.  Paul wrote most of SQL Server’s DBCC Internals while at MSFT.  And, so just as Moses is the law giver and brought the Messianic prophecy, once Paul says it, it is so:

 

DBCC Check DB

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=104090

There are a variety of documents and blog posts that detail the on-disk structures (e.g. Kalen’s books, my blog series on Inside The Storage Engine, etc) but you will not be able to fix the system tables. The error you’re getting suggests that there is more corruption in the system tables –

CHECKDB/CHECKTABLE will stop after the first gross corruption is found in a system table. Your options at this point are to manually use DBCC IND and DBCC PAGE to figure out the list of pages comprising your important tables in logical order and get as much info out of the pages one-at-a-time as you can.

Unless this is critical data and the database is small, this will take a *long* time to do. Unfortunately this isn’t the kind of situation that can be worked over a forum such as this, and Microsoft Product Support won’t help with data recovery (I know this as I co-wrote the policy document that prevents them doing it, while I was at MS).

Hope this helps.
Paul Randal SQL Server MVP, Managing Director, SQLskills.com

 

So again, what to do:

  1. If a passive end of DB Mirror relationship : backup active, disengage mirroring, restore backup (on passive), and re-engage mirroring
  2. If no mirror or active node of Mirror, Restore from a recent backup

 

References:

  1. How to troubleshoot error 3414
    http://support.microsoft.com/kb/2015741
  2. Data Corruption
    http://www.sqlservercentral.com/Forums/Topic345437-266-1.aspx#bm346617
  3. Understanding & Managing the Suspect pages table http://msdn.microsoft.com/en-us/library/ms191301(v=sql.90).aspx
  4. How to find the Object Name or ObjectID using the Page ID
    http://www.sql-server-performance.com/forum/threads/how-do-find-the-object-name-or-object-id-by-page-id-in-sql-2005.29201/

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s