SQL Server – Query Plan – When Sorts precede “Nested Loop” & “Merge” Joins

Background

Reviewing  a Query Plan and saw a Sort Operator standing between an Index Seek and the Inner Join that joins that table with another table.

Let us find out why.

 

Query Plan

Here is what the Query Plan looks like:

Nested Loops

QueryPlan-20160621-0703PM

Merge Join

MergeJoin-BigPicture

The Query

Here is what the Query looks like:


        declare @dateBegin datetime
        declare @dateEnd datetime

        set @dateBegin = '1/1/2015' 
        set @dateEnd   = '1/15/2015'

	select 
			  sp.student_int_id
			, sp.ref_int_id
			, pd.productCD
			, pd.productID
			, pa.payment_date
			, pa.amount revenue

	from [dbo].StudentProduct sp

	join [dbo].ProductDetails pd 
		on  sp.schoolID   = pd.schoolID 
		and sp.productID  = pd.productID

	join [dbo].PaymentAccount pa 
		on  sp.student_int_id = pa.student_int_id 
		and sp.ref_int_id    = pa.ref_int_id


	and pa.payment_code not in
		(
			  'GIFTCARD'
		)

	where pa.payment_date between @dateBegin and @dateEnd
	


 

Query Plan – Indexed Used

Table – dbo.PaymentAccount

As we are filtering on dates…

Tabular

IndexSeek

Image

IndexSeek-Image-Cropped

 

Explanation

Here are the main points

  1. Seek Predicates
    • Begin and End Dates (@dateBegin and @dateEnd)
  2. ( Additional ) predicates
    • payment_code
      • GIFTCARD

 

Table – dbo.StudentProduct

Tabular

IndexSeek-Tabular

Image

IndexSeek-Image (cropped)


Indexed Definition

dbo.PaymentAccount

dbo-PaymentAccount

 

Table – dbo.StudentProduct

dbo-StudentProduct

 

Query Plan – Join – Merge Join

Tabulated

MergeJoin-Tabulated

 

Image

MergeJoin-Image (Cropped)

 

Explanation

We are joining on two columns ( student_int_id and ref_int_id).

 

Facts Check

Nested Loop and Merge Joins require that the source tables be sorted along the lines of the columns being joined on.

In our sample, the index used for accessing the dbo.StudentProduct is PK_StudentProduct.  PK_StudentProduct is the primary key and it is a composite combination of two columns (student_int_id and ref_int_id).

On the other hand, the index used for accessing dbo.PaymentAccount is INDX_DBA_PaymentDate_PaymentCode and it is anchored  on PaymentDate and PaymentCode.

That index was chosen due to us filtering on payment date and paymentCode.

We are joining on student_int_id and ref_int_id, but unfortunately the dbo.PaymentAccount table returns us data based on a different column combination, the data that we will be judged unsorted and unsuitable for our subsequent Nested Loops \ Merge Join operations.

Prior to usage ( by the Join operator), that data will have to be sorted.

 

What does Microsoft Have to Say?

Here is a worthy write-up:

Join Type Level
Nested Loop Join Link
The nested loops join, also called nested iteration, uses one join input as the outer input table (shown as the top input in the graphical execution plan) and one as the inner (bottom) input table. The outer loop consumes the outer input table row by row. The inner loop, executed for each outer row, searches for matching rows in the inner input table.
In the simplest case, the search scans an entire table or index; this is called a naive nested loops join. If the search exploits an index, it is called an index nested loops join. If the index is built as part of the query plan (and destroyed upon completion of the query), it is called a temporary index nested loops join. All these variants are considered by the query optimizer.
A nested loops join is particularly effective if the outer input is small and the inner input is preindexed and large. In many small transactions, such as those affecting only a small set of rows, index nested loops joins are superior to both merge joins and hash joins. In large queries, however, nested loops joins are often not the optimal choice.
Merge Join Link
The merge join requires both inputs to be sorted on the merge columns, which are defined by the equality (ON) clauses of the join predicate. The query optimizer typically scans an index, if one exists on the proper set of columns, or it places a sort operator below the merge join.
Merge join itself is very fast, but it can be an expensive choice if sort operations are required. However, if the data volume is large and the desired data can be obtained presorted from existing B-tree indexes, merge join is often the fastest available join algorithm.
 Hash Join Link
Hash joins can efficiently process large, unsorted, nonindexed inputs. They are useful for intermediate results in complex queries because:
    Intermediate results are not indexed (unless explicitly saved to disk and then indexed) and often are not suitably sorted for the next operation in the query
plan.
    Query optimizers estimate only intermediate result sizes. Because estimates can be very inaccurate for complex queries, algorithms to process intermediate
results not only must be efficient, but also must degrade gracefully if an intermediate result turns out to be much larger than anticipated.

 

 

 

 

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s