Microsoft – SQL Server – Compression?

Microsoft – SQL Server – Compression?

One of the biggest drag in Very Large Databases (VLDB) is I/O throughput.  I/O throughput is constrained by the underlying storage sub-systems throughput, network pipe between the hosts and SAN (in our case), and host’s memory headroom.

If you have done a lot and your wait states still lead down the storage wilderness, then it might be time to look at data compression.

What are the requirements?

  • MS SQL Server 2008 and later (v2008, v2008/R2, v2012)
  • Enterprise Edition
  • Think hard about the requirement for “Enterprise Edition” as once you employ compression that database can no longer be restored on a downlevel edition (Standard, Desktop, etc)

When should you consider data compression?

  • If you have data that repeats and you think might benefit from coalescing
  • If your tables are not very well normalized and you do do have cross-referenced key columns

What are the benefits of data compression?

  • In the storage-subsystem the data is compressed, and thus you will be moving less data from your storage sub-system to the hosts
  • You will also be writing less data, as data is compressed before being written to the storage sub-system
  • In memory, the data is kept as compressed data, and so your memory footprint per record should be a bit less

In summary, you will have less data at rest on the storage sub-system, less data while in memory, and less data while in-transit

To me if there is selling point to high database systems, and in this case I am including MS SQL Server, it is their inherent flexibility in configuration.  They are very general purpose workhorse.

So when you think of Data Compression in SQL Server, do not feel boxed in.  You have a wealth of choices.  Those choices includes:

  1. You are able to apply various compression choices at very detailed level – Indexes – That is you do not make compression decision at the database nor the table level.  But, at individual index granularity
  2. You are able to choose between page and record level compression
  3. You are able to apply your compression choice and later un-apply it or choose another compression choice

Knowing that we have a choice as to how we want to dip into the proverbial water, we can  review what objects and indexes matter the most.

Identify what is churning in your memory:

Glen Berry has a really good lead-in.  As you think about whether compression is worth it, first determine what is in your buffer cache.

The truth is you might be surprised… User queries change all the time and an index that you added for John L. Smith in Accounting a while ago might now be heavily used, indexes that are needed might not be present and you are instead bringing in the entire table…

So look again at your memory space and try to account for each table\index combination.

Here is Glen Berry’s query:

some-data-compression-commands-for-sql-server-2008-and-2008-r2
http://sqlserverperformance.wordpress.com/2010/05/17/some-data-compression-commands-for-sql-server-2008-and-2008-r2/


/*
	-- Breaks down buffers used by current database by
	-- object (table, index) in the buffer cache
*/
SELECT

	   [object]
			= OBJECT_SCHEMA_NAME(p.[object_id])
				+ '.'
				+ OBJECT_NAME(p.[object_id]) 

     , tblIndex.name
     , p.index_id
	 , IndexType
		= max(tblIndex.type_desc)
     , COUNT(*)/128 AS [Buffer size(MB)]
     , COUNT(*) AS [BufferCount]
     , [CompressionType]
		   = p.data_compression_desc 


FROM sys.allocation_units AS a

INNER JOIN sys.dm_os_buffer_descriptors AS b

	 ON a.allocation_unit_id = b.allocation_unit_id

INNER JOIN sys.partitions AS p
     ON a.container_id = p.hobt_id

INNER JOIN sys.indexes as tblIndex
     ON  p.object_id = tblIndex.object_id
     AND p.index_id = tblIndex.index_id

WHERE b.database_id = DB_ID()

AND   p.[object_id] > 100

GROUP BY
           p.[object_id]
         , tblIndex.name
         , p.index_id
         , p.data_compression_desc

ORDER BY [BufferCount] DESC

 

Review your Application workload

Microsoft has done a commendable job in its published white-paper:

Data Compression: Strategy, Capacity Planning and Best Practices
http://msdn.microsoft.com/en-us/library/dd894051.aspx

Among other things, MSFT listed out when compression and decompression actually occurs and did a good comparative analysis on various workloads;  thus one can actually better determine which queries and access patterns will likely benefit from compression.

As an aside, anytime I see Sunil Agarwal’s name on a document, I know it is top quality.

Here is some published papers by Sunil Agarwal
http://blogs.msdn.com/b/sqlserverstorageengine/

Back to MSFT White-paper – Here is my read:

In general:

  • When compressed ? Data at rest is obviously compressed
  • The data is kept compressed in memory
  • It rests on the SQL Engine to determine which data to decompress and generally that occurs when filtering is initially occurring or during query’s completion and data is being grouped and sorted
  • To me, one has to be pretty good at reading query plans and at understanding where your cost is; and that leads to identifying which tables and indexes are involved in the costly operations
  • In essence look at your access patterns (Query Plans) and try to see which ones are Seeks and which ones are Scans  — Scans as they typically involve churning through quite a bit of data will probably benefit from compression. Compression indirectly offers you a bigger IO — You can bring in more compressed data as compared to un-compressed data
  • Because MSFT design and implementation of compression is so very granular and is targeted at individual indexes and partitions, one has a lot of choices – You can compress indexes that are used during scan operations and leave indexes employed during seeks un-compressed
  • Again understand your queries and try to use persisted computed columns and aggregated tables rather than pay a price during each retrieval.  If is often better to pay that price once when the record is initially persisted
  • Along the same lines your Database Engine is not where you want to perform complex query analysis.  It is better to do so on your Application, Analysis Servers or whatever you call them.  For one thing, you can distribute that load across many machines and you actually have better targeted tools to profile and tune them
  • Look at your CPU utilization — If over 40%, then you probably do not want to introduce compression as it is slanted towards CPU.   And, when high, make sure that you do not have “Power Savings” enabled.  If their are boxes where you do not want Power Savings it is on your DB and VM Hosts

And, more precisely, this is where Microsoft nailed it.  They talk about your U and S.

Here is the U:

  • The percentage of U.  That is the percentage of Updates.  The lower this number, the more likely compression should be considered
  • If you have a lot of updates, then again you are going back and changing data and thus re-introducing needs for reading (seeks) and compression
  •  These days, I rarely see an application that actually performs a high ratio of updates.  Mostly I see a lot of Appends

Here is the S:

  • The percentage of S.  If you have a lot of Scans as a total of all access, then you will probably benefit from compression

Here are queries nicely prepared in MSFT’s white paper:

U: Percent of Update Operations on the Object


SELECT
           o.name AS [Table_Name]
         , x.name AS [Index_Name]
         , i.partition_number AS [Partition]
         , i.index_id AS [Index_ID]
         , x.type_desc AS [Index_Type]
         ,
             i.leaf_update_count * 100.0 /
                (i.range_scan_count + i.leaf_insert_count
                  + i.leaf_delete_count + i.leaf_update_count
                  + i.leaf_page_merge_count + i.singleton_lookup_count
                ) AS [Percent_Update]
FROM sys.dm_db_index_operational_stats (db_id(), NULL, NULL, NULL) i
       JOIN sys.objects o
              ON o.object_id = i.object_id
       JOIN sys.indexes x
              ON x.object_id = i.object_id
              AND x.index_id = i.index_id
WHERE (i.range_scan_count + i.leaf_insert_count
       + i.leaf_delete_count + leaf_update_count
       + i.leaf_page_merge_count + i.singleton_lookup_count) != 0
AND objectproperty(i.object_id,'IsUserTable') = 1
AND o.name =  <objectName>
AND x.name = <indexName>
ORDER BY [Percent_Update] ASC

S: Percent of Scan and Seek Operations on the Object


SELECT
          o.name AS [Table_Name]
         , x.name AS [Index_Name]
         , i.partition_number AS [Partition]
         , i.index_id AS [Index_ID]
         , x.type_desc AS [Index_Type]
         , (i.range_scan_count * 100.0) /
           (i.range_scan_count + i.leaf_insert_count
            + i.leaf_delete_count + i.leaf_update_count
            + i.leaf_page_merge_count + i.singleton_lookup_count
           ) AS [Percent_Scan]

         , (i.singleton_lookup_count * 100.0) /
           (i.range_scan_count + i.leaf_insert_count
            + i.leaf_delete_count + i.leaf_update_count
            + i.leaf_page_merge_count + i.range_scan_count
           ) AS [Percent_Seek]

FROM sys.dm_db_index_operational_stats (db_id(), NULL, NULL, NULL) i
       JOIN sys.objects o
              ON o.object_id = i.object_id
       JOIN sys.indexes x
             ON x.object_id = i.object_id
             AND x.index_id = i.index_id
WHERE (i.range_scan_count + i.leaf_insert_count
       + i.leaf_delete_count + leaf_update_count
       + i.leaf_page_merge_count + i.singleton_lookup_count) != 0
AND objectproperty(i.object_id,'IsUserTable') = 1
AND o.name = @objectName
AND i.name = @indexName
ORDER BY [Percent_Scan] DESC

What are the different types of compression?

There are two different types of compressions, page and row.  Here is how they differ?

Compression -> Row

  • Row works at the individual record level.  It looks at each record independently and goes against some of what we assumed ar hard-wired into MS SQL Server data storage
  • Ross Mistry has a good short article in http://www.informit.com/articles/article.aspx?p=1946159&seqNum=6 that made it easy for me to understand Row Level compression.  It used by green to define a column such StreetAddress char(2000).  Doing so will allocate 2000 bytes for each StreetAddress irrespective of the actual data contents.  But, Row Level compression “saves you” in that only the actually needed byte size is used.
  • From same article was able to determine that data is not actually compressed, just un-needed bytes are not taken up

Compression -> Page

  • Actual compression based on data page’s data
  • You get Row Level compression, as well 

 

How much cost savings from compression?


-- Estimate data compression difference on clustered index
-- if compression --type is NONE
EXEC sp_estimate_data_compression_savings
         'SchemaName'
        , 'TableName'
        , 1
        , NULL
        , 'NONE'
     ;

-- Estimate data compression difference on clustered index
-- if compression type is ROW
EXEC sp_estimate_data_compression_savings
           'SchemaName'
         , 'TableName'
         , 1
         , NULL
         , 'ROW'
    ;

-- Estimate data compression difference on clustered index
-- if compression type is PAGE
EXEC sp_estimate_data_compression_savings
            'SchemaName'
          , 'TableName'
          , 1
          , NULL
          , 'PAGE'
        ;

Pasted below are speculative savings based on whether we employ Row or Page Compression:

Row Compression Savings?

CompressionSavings - Row - 20130316

Page Compression Savings?

CompressionSavings - Page - 20130316

Key points:

  • Because our table is already partitioned we have a lot more sample data; as savings are calculated for each partition in a table
  • On an average, it appears that we will reap 50% savings for Row Level compression
  • On the other-hand for Page Level compression, based on representative data, if compressed our size will be 15% of what it will be if left un-compressed; thus we will reap upwards of 85%

Actual Compression Steps:

Compress – Index (Clustered and Non-Clustered)

In the example below, we are employing Page Level Compression on a partitioned table. Here are the other characteristics of this particular processing:

  • Sorting in Tempdb as this is an Enterprise Edition Instance.  Sorting in tempdb is faster as we have TempDB on fast storage
  • Make sure to repeat this step for all partitions in the table

--Syntax
Alter Index indexName
On [schema-name].[object_name]
Rebuild Partition = [partitionNumber]
with
(
     SORT_IN_TEMPDB = ON
   , DATA_COMPRESSION = PAGE
) 

--Actual
Alter Index idx_Customer_Name
on [dbo].[customer]
Rebuild Partition = 1
with
(
     SORT_IN_TEMPDB = ON
   , DATA_COMPRESSION = PAGE
)

Compress – Index (Clustered and Non-Clustered)

Page level compress to an un-partitioned index.


--Syntax
Alter Index indexName
On [schema-name].[object_name]
Rebuild
with
(
     SORT_IN_TEMPDB = ON
   , DATA_COMPRESSION = PAGE
) 

--Actual
Alter Index idx_Customer_Name
on [dbo].[customer]
Rebuild
with
(
     SORT_IN_TEMPDB = ON
   , DATA_COMPRESSION = PAGE
)

Compress – Heap

In cases where we do not have a clustered index, but a heap you can apply compression to the table directly.


--Syntax
Alter Table [schema-name].[object_name];
Rebuild Partition = [partitionNumber]
with
(
     SORT_IN_TEMPDB = ON
   , DATA_COMPRESSION = PAGE
) 

--Actual
Alter Table [dbo].[customer]
Rebuild Partition = 1
with
(
     SORT_IN_TEMPDB = ON
   , DATA_COMPRESSION = PAGE
)

Compress – Heap

In cases where your heap is not partitioned, then leave out the partition option


--Syntax
Alter Table [schema-name].[object_name]
Rebuild
with
(
     SORT_IN_TEMPDB = ON
   , DATA_COMPRESSION = PAGE
) 

--Actual
Alter Table [dbo].[customer]
Rebuild
with
(
     SORT_IN_TEMPDB = ON
   , DATA_COMPRESSION = PAGE
)

Preparation:

  • Ensure that Database Mirroring is set to high performance — You really do not want to pay the price for High Safety
  • Dedicate fast IO and abundant storage for Log Space
  • Dedicate fast IO for TempDB

What should you expect to see:

Statistics Gathering 


SELECT StatMan([SC0], [SB0000])
FROM (
	SELECT TOP 100 PERCENT
                   [SC0]
   	         , step_direction([SC0]) over (order by NULL) AS [SB0000]
	FROM (
                 SELECT [column1] AS [SC0]
      	         FROM [dbo].[object-name]
	         TABLESAMPLE SYSTEM ([percentile]; PERCENT) WITH (INDEX(1),READUNCOMMITTED)
 	          WHERE $PARTITION.partitionFunction([partitionFunction]) = 21
             )  AS _MS_UPDSTATS_TBL_HELPER
        ORDER BY [SC0], [SB0000]
     ) AS _MS_UPDSTATS_TBL  OPTION (MAXDOP 1)

  • The SQL Engine gathers statistics
  • As always, one learns a lot when one is able to look under the hood a bit
  • In our case, I am able to confirm that we have chosen the wrong first column for our partitioned index — more on that in a later post
  • The Table Hint — INDEX(1), READUNCOMITTED is also new to me — I am used to using NOLOCK as compared to the more probable ANSI standard
  • Usage of MAXDOP 1 — asking for no parallelism and thus reducing the likelihood that we will starve the system for resources

“select *, %%bmk%% from tablename” while rebuilding the index

  • This is one of the signatory statements that occurs when an index is being rebuilt; as is the case with turning on compression

SQL Server DBA Diaries
http://www.sqldbadiaries.com/2010/09/06/select-bmk-from-tablename-while-rebuilding-the-index/#ixzz2Nhhe9viZ
 

Quite a bit of session blocking

  • Index rebuilding needs a bit of locking — such as Schema locks (LCK_M_SCH_S)

 

Performance Profiling

Performance Profiling – Performance Counter

Performance Profiling – Dynamic Management Views

To track compression at individual index level you can use the sys.dm_db_index_operational_stats DMV.

sys.dm_db_index_operational_stats

Once you rebuild your indexes existing columns are compressed.  To track ongoing compression attempts and success you can query the sys.dm_db_index_operational_stats view.

The column names are:

  • page_compression_attempt_count
  • page_compression_success_count

use [dbHR];

declare @dbid int
declare @objectID int

set @dbid = DB_ID()
set @objectid = OBJECT_ID('dbo.employee')

select
		  OBJECT_NAME(tblOperationalStats.object_id) as objectName
		, tblIndex.name
		, tblOperationalStats.partition_number
		, tblOperationalStats.page_compression_attempt_count
		, tblOperationalStats.page_compression_success_count
from   sys.dm_db_index_operational_stats(@dbid, @objectid, null, null)
             tblOperationalStats
	  inner join sys.indexes tblIndex

		on tblOperationalStats.object_id = tblIndex.object_id
		and  tblOperationalStats.index_id = tblIndex.index_id
order by
	    tblOperationalStats.page_compression_success_count desc
	  , tblOperationalStats.page_compression_attempt_count desc		

Here is the result:

CompressionStats - Dynamic Management View (v2)

I was surprised to see that that our page_compression_success_count is very low relative  to our page_compression_attempt_count.

But, luckily Jonathan Kehayias blogged about this:

An xevent a day 28 of 31 – Tracking Page compression operations
http://www.sqlskills.com/blogs/jonathan/an-xevent-a-day-28-of-31-tracking-page-compression-operations/

The sqlserver.page_compression_attempt_failed is really poorly named, and doesn’t provide information about failures in the sense that something broke, but provides information for why a page compression attempt did not actually change the compression of the data in the page.

The PageModCountBelowThreshold failure isn’t really a failed attempt at compression.  This reason shows that the page was evaluated for recalculation, and the modified counter for the page hadn’t passed the internal threshold for recalculation so the actual compression operation wasn’t performed.

So based on Jonathan’s public commentary, it appears that MSFT rightfully considers whether compression is actually needed\beneficial.  And, if so they notch up the page_compression_success_count.

And, if not the the page_compression_success_count is left as is.

This makes more sense once one considers that overall compression has already occurred during the index rebuild operation and that this current tally is just ongoing compression consideration.

 

Performance Profiling – OS (MS Windows) – Performance Monitor

At a MS SQL Server Instance level, one can get a high level view on how frequently page compression is attempted and actually fulfilled.

  • SQLServer:Access Methods –> Page compression attempts/sec
  • SQLServer:Access Methods –> Page compressed/sec

 

Perfmon - SQL Server - Access Methods

What to keep in mind

What to keep in mind – Table

  • We have concentrated on index compression so far.  And, really not said much about actual table compression.  If you have a clustered index, then you can concentrate solely on the clustered index, as the clustered index is representative of the data — Clustered index is index=1
  • On the other hand if you have a heap then apply compression to the table and look for index =0

Processing Order

  • If you have chosen to compress more than one index on a table and your candidates index includes a clustered index, then apply compression to the clustered index first; as all indexes are rebuilt whenever the clustered index is rebuilt

References:

2 thoughts on “Microsoft – SQL Server – Compression?

    • Thank you sir.

      You guys push so many out there.

      And, you ‘re the true evangelist of our day.

      I am just trying to stop talking to myself and somehow stop collecting notes in bits of paper and notebooks.

      So thank you for being among the ones who lead and in so doing building a community of like minded.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s