SQL Server – Data Sampling – TABLESAMPLE

Background

A quick follow-up to a couple of blog posts on pruning.  We have a requirement to prune about a dozen log tables.  The lone decisive factor is the number of months elapsed.

After experiencing deadlocks and the like took to the .Net and re-discovered that it might be better to use the Clustering ID rather that than the date.

 

Previous Blogs

  1. Transact SQL – Prune Data based on Date
  2. Transact SQL – Prune Data based on Date – Material Effect of Indexes

 

Why might it be better to use the ID?

There are a couple of percepts behind why it might be better to use the ID…

  1. Halloween Effect
  2. Deadlocks
    • When different column combinations are use to search for data than are used per the actual deletion, the latter locks may already have been granted for another session

SQL Statements

Determine ID High Bar Value

Back to our current problem, we need to identify the high bar


select max([id])
from   [table]
where  [date-column] < dateadd('month', [number-of-months] * -1, getdate())

Actual Deletion

And, here is a snippet of the batched delete command.


declare @NumberofRecordsInEachBatch bigint
declare @iNumberofRecordsAffected   bgint

set @NumberofRecordsInEachBatch = 1000
set @iNumberofRecordsAffected = -1


while (@iNumberofRecordsAffected != 0)
begin


    delete top (@NumberofRecordsInEachBatch) tblL

    from   dbo.[MailingListRequestsLog] tblL

    where  tblL.[entryID] < @idPKCutoff 

    set @iNumberofRecordsAffected = @@ROWCOUNT

end

Index Requirement

We can see quite quickly that we will need an Index on the date column.

Index Costs

As anything else, if a corresponding index does not exist, we have to measure the cost of creating one.

Index Size

Using sp_helpindexinfo, here are our current indexes, their size and number of rows

sp_helpindexInfo

Size is 280 GB and 204 million records.

Index Script

INDX_DBA_SentDt

Tried creating an index on date column…


IF NOT EXISTS 
(
	SELECT * 
	FROM   sys.indexes 
	WHERE  object_id = OBJECT_ID(N'[dbo].[MailingListRequestsLog]') 
	AND    name = N'INDX_DBA_SentDt'
)
begin

	CREATE NONCLUSTERED INDEX [INDX_DBA_SentDt] 
	ON [dbo].[MailingListRequestsLog]
	(
		[SentDt] ASC
	)
	INCLUDE 
	(
		[entryID]
	) 
	WITH 
	(
		FILLFACTOR = 100
	) 
	ON [Indexes2]

end


It is always nice to be able to confidently state that FILL_FACTOR is 100.

Not so sure whether the Primary Key( entryID ) should be part of our index key columns, an add on using INCLUDE, or not needed at all.

 

Create Index

Tried creating the index, but aborted due to glaring session blocks, etc.

Remediation Choices

Create Index ONLINE

If we had an Enterprise Edition of SQL Server, I can try


ONLINE = { ON | OFF }

But, our edition is standard.

 

Sample Table

One hopefully, creative choice is to create a table that sample’s our existing data.

Requirements

  1. It has to be big enough to give us good correlation between the date and the ID column.
  2. Its creation should minimally affect ongoing access on the primary table
  3. It’s size should be manageable

 

Script

Create Tracking Table – [dbo].[MailingListRequestsLogTracking]


SET ANSI_NULLS ON
SET QUOTED_IDENTIFIER ON
GO

set noexec off;
go

if object_id('[dbo].[MailingListRequestsLogTracking]') is not null
begin

	set noexec on;

end
go

CREATE TABLE [dbo].[MailingListRequestsLogTracking]
(

	  --[entryID] [int] IDENTITY(1,1) 
	  [entryID] [bigint]

	, [sentDt] [datetime] NOT NULL 
			CONSTRAINT [DF_MailingListRequestsLogTracking_sentDt]  DEFAULT (getdate())

	, CONSTRAINT [PK_MailingListRequestsLogTracking] PRIMARY KEY CLUSTERED 
	(
		[entryID] ASC
	)

) ON [PRIMARY]

set noexec off;
go

if not exists
	(
		select tblSI.*
		from   sys.indexes tblSI
		where  tblSI.object_id = object_id('[dbo].[MailingListRequestsLogTracking]') 
		and    tblSI.name = 'INDX_SentDt_EntryID'
	)
begin

	create index [INDX_SentDt_EntryID]
	on [dbo].[MailingListRequestsLogTracking]
	(
		  [sentDt]
		, [entryID]
	)

end
go

Diagram

Here is what our existing table, dbo.MailingListRequestLog, looks like.

And, on the right side is our tracking table  dbo.MailingListRequestLogTracking.

MailingListRequestsLog

 

Population Script




SET ANSI_NULLS ON
SET QUOTED_IDENTIFIER ON
GO

IF NOT EXISTS 
(
	SELECT * 
	FROM   sys.schemas
	WHERE  schema_id = schema_ID('dbmgmt') 
)
BEGIN

	EXEC dbo.sp_executesql @statement = N'CREATE SCHEMA [dbmgmt] authorization [dbo]' 

END
GO


IF NOT EXISTS 
(
	SELECT * 
	FROM   sys.objects 
	WHERE  object_id = OBJECT_ID(N'[dbmgmt].[usp_MailingListRequestsLogTracking_Populate]') 
	AND    type in (N'P', N'PC')
)
BEGIN

	EXEC dbo.sp_executesql @statement = N'CREATE PROCEDURE [dbmgmt].[usp_MailingListRequestsLogTracking_Populate] AS' 

END
GO


ALTER procedure [dbmgmt].[usp_MailingListRequestsLogTracking_Populate]
as

begin

	set transaction isolation level read uncommitted;
	set nocount on;
	set XACT_ABORT on;

	declare @CHAR_TAB varchar(30)

	declare @iNumberofRecordsToSample bigint
	declare @iNumberofRecordsAffected bigint
	declare @strLog	varchar(600)

	set @CHAR_TAB = char(9)

	set @iNumberofRecordsToSample = 1000

	truncate table [dbo].[MailingListRequestsLogTracking];

	; with cteTop
	(
		  [entryID]
		, [sentDt]

	)
	as
	(
		select top 1

				  tblMLRLT.[entryID]

				, tblMLRLT.[sentDt]

		from   [dbo].[MailingListRequestsLog] tblMLRLT 

		order by [entryID] asc
	)
	, cteBottom
	(
		  [entryID]
		, [sentDt]

	)
	as
	(
		select top 1

				  tblMLRLT.[entryID]

				, tblMLRLT.[sentDt]

		from   [dbo].[MailingListRequestsLog] tblMLRLT 

		order by [entryID] desc
	)

	, cteSample
	(
		  [entryID]
		, [sentDt]

	)
	as
	(
		select 

				  tblMLRLT.[entryID]

				, tblMLRLT.[sentDt]

		from   [dbo].[MailingListRequestsLog] tblMLRLT

					TABLESAMPLE (  1000000 ROWS)

					with ( NOLOCK )

	)

	insert into [dbo].[MailingListRequestsLogTracking]
	(
		  [entryID]

		, [sentDt]
	)

	select 

		  [entryID]
		, [sentDt]

	from   cteTop

	union

	select 

			 [entryID]

			, [sentDt]

	from   cteSample

	UNION

	select 
		  [entryID]
		, [sentDt]

	from   cteBottom


	set @iNumberofRecordsAffected = @@rowcount

	set @strLog=
					@CHAR_TAB
					+ cast(@iNumberofRecordsAffected as varchar(10))
					+ ' record(s) affected'

	print @strLog

end
go


 

Quick Explanation
  1. Common Table Expression
    • cteTop
      • In cteTop, we get the earliest record
    • cteBottom
      • In cteBottom, we get the latest record
    • cteSample
      • In cteSample, we invoke SQL Server’s TABLESAMPLE clause to sample a million record

 

Pros & Cons

Cons

  1. TableSample is expensive
  2. Correlative with your sample size, you still get to trash your data cache

Statistics

Session

  1. On our DR Server which effectively does not have any data cached
    • It took 5.46 minutes to get 1 million records
Statistics IO

StatisticsIO

Explanation
  1. Though we asked for a million records, we only got back 999,330 records
    • Please plan ahead and pass in more records than you actually need

 

Index Size

Let us use sp_helpindexinfo to get index stats.


exec [dbo].[sp_helpindexInfo]
		  'dbo'
		, 'MailingListRequestsLog'

exec [dbo].[sp_helpindexInfo]
		  'dbo'
		, 'MailingListRequestsLogTracking'

Output
Graphical

IndexSize

Tabulated

 

Table Index Number of Records Size (MB)
dbo.MailingListRequestsLog
PK_MailingListRequestsLog  204,848,859  280,135
 MailingListRequestsLogTracking
PK_MailingListRequestsLogTracking 999330  24
INDX_SentDt_EntryID 999330 18

 

Summary

With a bit of reasoned thoughts, vexing problems are a bit more approachable in cases where sampled data is sufficient.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s