Question:

How does the query optimizer choose a plan for my SQL query?

Answer:

First, a disclaimer: Don’t perform these commands on any production or unit test server – EVER. These scripts are used for training; so DO NOT package them up and run them in production!

The optimizer in SQL Server is a cost based optimizer. That is, the optimizer assigns a cost to all the candidate plans generated within a period of time, and choses the lowest cost plan for execution.  There are many steps, and bits of information that the optimizer uses to do this.

Thinking like the optimizer:

    • Does the query qualify for a Trivial plan or is Full optimization necessary?
    • Does the underlying table have a Clustered index? Or is it a Heap?
    • Are the columns contained in the Where predicate indexed?
    • Is the predicate sargable?
    • Do the underlying Index[es] have current statistics?
    • Are there joins involved in the query?
    • Is there an Order By clause in the query?
    • Are aggregates being used in the WHERE predicate?
    • others

The query optimizers job is to find the best overall plan in a given amount of time.  Since queries with multiple joins could have thousands of potential candidate plans, generating a “perfect” plan would be time, and memory, prohibitive.  Many times the optimizer will stop the optimization process with a “good enough plan found.”

Your SQL statement goes through four stages on its way to being executed:

Parse -> Bind -> Optimize -> Execute

Parsing: the SQL statement is checked for correct syntax, spelling, and keyword usage.  A parse tree is returned as  a result of this operation.

Binding: the parsed tree is then checked against database metadata to see if columns exist, and if the query is semantically correct.

Optimize: once the query tree is validated, it is passed to the optimizer which will create either a trivial plan, or perform full optimization where several alternate plans are evaluated. This involves phases: Simplification, Trivial Plan, Auto-Stats, Exploration (three stages), Convert to executable plan.

Execution: the optimizer chooses the plan for execution.

Let’s take a look at each bullet point above, and explain a few basic actions of the optimizer with examples.

Trivial plan or Full optimization?

One of the first operations of the optimization process, is to determine if the query qualifies for a Trivial plan.  A trivial plan is one in which there really isn’t any other way to execute the query – like:

SELECT * FROM SOME_TABLE;

The optimizer will go out and scan the table and return the results to the client. Even a query like this, against a Heap (table without a clustered index), results in a trivial plan:

SELECT CT.CustomerID
              ,CT.AccountNumber
FROM DBO.CUSTOMER_TEST CT
WHERE CustomerID = 10

As the query becomes more complex, with table joins, or complex Where predicates, the optimizer may choose a Full optimization plan. Defining each step of full optimization is beyond the scope of this blog.  I will, however, touch on a couple points involved.

Does the underlying table have a Clustered Index or is it a Heap?

When a table is created, it is created as a Heap structure by default. In a nutshell, a heap is an unorganized structure where records are not stored in any particular order.  By comparison, creating a table and assigning a clustered index implements a B+ tree structure to the data.  Heaps are great if you’re doing massive inserts, or using them as log files.  Heaps are not so good for running OLTP queries though.  When you query a heap table, it has to scan the table to find a particular value identified in the Where predicate.  Imagine that on a table that contains > a million rows!  Here is something to remember:  Scan = BAD, Seek = Good!  At least 99% of the time this is true.

Heaps are also fragmentation magnets, it has to do with the page allocation and deallocation mechanism. So, it is best not to use a heap; unless you have a special need, and have tested their use extensively (that is, before you go to production!).

Creating a table and assigning a carefully chosen clustered index is usually the best alternative.  The clustered index stores data in the order of the clustering key.  Additionally, in a clustered index, the leaf page is the data.  Conversely, a non-clustered index has a Key or Row ID that points to the data in a clustered index or a heap, respectively.  This is known as a “bookmark lookup” and it is an expensive operation.

A final note on clustered indexes, a primary key constraint is created as a clustered index by default. You can also explicitly create the primary key as non-clustered, making it a unique constraint as well if desired.  You may experience this scenario when a surrogate key is created as the clustered index and a primary key is created using a non-clustered, unique constraint on a column, or set of columns, that identifies the business rule being applied.

T-SQL code for finding heaps:

USE AdventureWorks2014        — enter your database
GO
SELECT T.name AS [TableName],
                I.name AS [IndexName],
                I.type_desc AS [IndexType]
FROM SYS.indexes I INNER JOIN SYS.tables T ON 
                I.object_id = T.object_id
—  WHERE I.type_desc = ‘HEAP’  — uncomment to view HEAPS only
ORDER BY TableName, IndexType
GO

Are the columns contained in the Where predicate indexed?

If not, they probably should be. When building the index on the predicate columns, place the equality predicates first, that is: Where Column_a = b and Column_b > c … Column_a would be place as the first, or left-most, column in the index.  If you have a predicate with several columns, SQL Server can use a mechanism called index intersection.  This works when there are other indexes present and they are joined together to satisfy the query.

A “covering index” can also be used. The covering index is created using the columns of the predicate and including those columns also present in the Select statement.

SQL:

SELECT COL1, COL2, COL3
FROM MyTable
WHERE COL4 = value and COL5 = value;

Create index IDX_Covering on MyTable (col4, col5)   — Where predicate columns
Include (col1, col2, col3);                                              — Select statement columns

Since the covering index contains all the data columns needed to satisfy the query, it will only access the index. There is no need to access the [base] table thereby reducing I/O.

Is the predicate Sargable?

When a predicate is [S]earch [arg]ument able, it is able to use an existing index to satify the query.  Some predicates are non-sargable due to data type conversion, or other TSQL functions within the predicate.  This would render any existing index unusable by the optimizer.  No index = no seek operations.  Remember: seek = good, scan = bad!

Generally, these operators are Sargable: =, >, <, >=, <=, BETWEEN and LIKE ‘xyz%’

These aren’t: !=, !<, !>, NOT IN, NOT EXISTS, NOT LIKE, and LIKE ‘%xyz%’

Do the underlying indexes have current statistics?

Since SQL Server uses a cost-based optimizer, it relies heavily on current statistics when analyzing a SQL statement.  Statistics are created automatically, and SQL Server can auto-generate new statistics if they qualify as being out of date.  One of the steps the optimizer performs during full optimization is to load statistics.  The optimizer will then have access to values such as:  the table size, record distribution, cardinality, etc. which it uses to derive alternative plans during the Exploration phase of optimization.

Statistics are created automatically by setting the AUTO_CREATE_STATISTICS and AUTO_UPDATE_STATISTICS options.  These are both set “on” by default, and in most cases this is fine.  The DBA may have a particular reason for turning one, or both, of these options off.  They can also be ignored during index creation – using a WITH statement.

CREATE NONCLUSTERED INDEX IDX_Person_ModifiedDate
ON Person.Person (ModifiedDate)
WITH (STATISTICS_NORECOMPUTE = ON);

It is best to check with your DBA concerning statistics questions for your database.

Are there joins present in the query?

If so, are there indexes placed on the columns used in the join predicate? If not, you may see significant performance issues, specifically with I/O.  Indexes built on the foreign key columns will enhance performance similarly to creating an index on a predicate column on a regular table.

Is there an Order By used in the query?

Sort operations can be expensive. One way to avoid a sort operation is to use an index with the prescribed sort order already in place.

Consider the query:

SELECT [OrderDate]
              ,[PaidDate]
              ,[CustomerID]
FROM [JProCo].[dbo].[SalesInvoice]
WHERE CustomerID = 20
ORDER BY PaidDate DESC;

Note that there is a primary key and a non-clustered index present on the table – SalesInvoice. No index exists on CustomerID, nor the order by column – PaidDate.  The plan below shows a sort (60% of the cost)  and a Key lookup into the clustered index.  If we were to build an index on the Where predicate column, and the order by column, in the sort order (decending) specified, maybe we could eliminate the sort and the key lookup.

sql without index

Here is the create index statement:

CREATE INDEX IDX_TEST_PAIDDATE
   ON DBO.SALESINVOICE (CUSTOMERID, PAIDDATE DESC);

Here is the new execution plan:

sql with index noncovering

Well, the sort operator is gone, but we still have the key lookup. Let’s see if we can improve this by creating a covering index, which will be the same as the previous index definition, but include the column: OrderDate.  First, we will delete the index: IDX_TEST_PAIDDATE, and recreate it as a covering index.

Here is the create index statement:

CREATE INDEX IDX_TEST_PAIDDATE
   ON DBO.SALESINVOICE (CUSTOMERID, PAIDDATE DESC)
INCLUDE (ORDERDATE);

Here is the new execution plan using the covering index:

sql with covering index

That’s better! Now we have an index [only] seek with no key lookup into the base table, and no sort operation.  Eureka!

Please note that the optimizer still chooses what it considers the best plan. You still have to do the work of testing, and re-testing scenarios to see if an index might help your query.

Are aggregates being used in the WHERE predicate?

Aggregate functions Sum(), AVG(), Date() are all great tools, but may also cause lackluster performance when used under certain circumstances. If you are writing a report, or pulling data for a chart, these functions can be very helpful.  However, if you are running an OLTP application, with very tight SLA’s on your transactions, using these functions can decrease performance.

Consider the following queries:

— #1
SELECT SOH.SalesOrderID
             , SOH.OrderDate
FROM Sales.SalesOrderHeader SOH
WHERE  OrderDate >= ‘2011-01-01 00:00:00’
    AND
               OrderDate <  ‘2012-01-01 00:00:00’;

— #2
SELECT SOH.SalesOrderID
             , SOH.OrderDate
FROM Sales.SalesOrderHeader SOH
WHERE YEAR(OrderDate) = 2011;

Both return the same values, but query #2 uses the YEAR() function to retrieve records for the year 2011. In contrast, query #1 uses a range predicate to do the same thing.  The execution plans are nearly identical – they both use clustered index scan operations.

The optimizer will recommend a non-clustered index for query #1, using OrderDate and including SalesOrderID columns. This new index will result in a new execution plan using an index seek operation vs. an index scan.  Query #2 will receive no such recommendation, as an index on query #2 won’t help.

Summary

I’ve presented some of the processing that the query optimizer uses to select an execution plan. Query tuning is a very broad topic, and I have mearly scraped the tip of the iceberg here.  Hopefully, this will give you some insite into your own queries, and help eliminate some low-hanging fruit you may have right now.

Many things can affect performance: network issues; disk latency; waits – locking and latching; memory pressure, etc. Knowing how to diagnose the problem quickly is the key, so you must know what to rule out during the troubleshooting process.

I’ll be writing more blogs on the optimization, wait stats, and locks and blocking, with corresponding TSQL to assist you in finding these performance bottlenecks.

Knowledge is Power!