Python – SQL Server & PyODBC Connectivity

Python and SQL Server using PyODBC

Connecting to SQL SERVER with Python is easier than you might expect.  With the PyODBC module, you can create
an ODBC connection to your SQL Server instance.

First you need access to the PyODBC module, which comes with certain implementations of Python.  If you use
Anaconda Core package, you should have it by default. https://docs.continuum.io/anaconda/pkg-docs.html
You can also access it from the creator: Michael Kleehammer on GitHub:  https://github.com/mkleehammer/pyodbc

For this exmaple code, I’m using an Intel based computer running Windows 10 OS, and SQL Server 2016 Developer Edition.

The steps are straightforward enough:

1. import the pyodbc module
2. create a connection object – with a connection string
3. create a cursor object based on the connection
4. create a string to form a SQL DML command:  (Select, Insert, Update, Delete)
5. use the cursor object execute method to execute the command string
6. use one of the cursor fetch statements to pull the data:  fetchone( ), fetchall( ), or fetchmany( )
7. iterate through the data using a loop – foreach or while
8. close the cursor and the connection

*Noteworthy:  This is a simple example highlighting the connection and querying of a SQL Server database.  You would normally enclose this code in try…catch logic with explicit error handling in a “real” work environment.

I’ve coded a basic example of the steps outlined above.  Again, as in other Python examples, I’ve copied
screen output from my Jupyter notebook.  I’ve created a TestDB database on SQL Server, with a test table
to use with this demo.  I’ll be using Windows Authentication to access SQL Server.

SQL table creation:

PyODBC import, connection build, query execution, and displaying results:

SQL Statistics (keys, tables, etc.):

SQL Stats II:

Summary

In this blog I’ve demostrated how you can import the PyODBC module to create a connection, execute a SQL command,
and retrieve data from your SQL Server instance.  I’ve also included examples for retrieving metadata from your
SQL Server instance.  PyODBC is a feature rich module for accessing a SQL Server database.

See the references below for more information on the topic.


References

Websites:
PyODBC on GitHub:  https://github.com/mkleehammer/pyodbc
PyODBC Documentation:  https://github.com/mkleehammer/pyodbc
MS Channel 9 on PyODBC:  https://channel9.msdn.com/Events/Connect/2016/189

SQL Server: TempDB

Question:

What is TempDB?

Answer:

This is a short blog post on TempDB.  TempDB is essentially a scratch database for SQL Server.  It’s also one of, if not the, busiest databases on an instance.  Learning what it does, and does not do, is important both for the DBA, but also for the developer.  Setting up TempDB is also an important task – sizing, creating multiple TempDB files, and configuring autogrowth correctly will help make your DBA life simpler.

I’ll start off with a list of facts on TempDB, followed by some DMV’s to look at TempDB useage and performance, and finally a word on troubleshooting.

Some TempDB facts:

  1. Only one TempDB file can be present on an instance
  2. Essentially the scratch database for SQL Server
  3. TempDB contains a data file and a log file
  4. You cannot backup TempDB
  5. TempDB uses Simple recovery model
  6. TempDB is cleared upon instance restart
  7. TempDB uses the Primary file group, you cannot add other file groups
  8. You can have multiple TempDB files within the filegroup
  9. Multiple TempDB datafiles should be sized the same, and their Autogrowth size should be the same. Don’t use percentages for autogrowth
  10. Don’t accept the default AutoGrowth configuration for TempDB
  11. TempDB holds three(3) things: user data, internal data, and the Version Store
  12. TempDB doesn’t log redo information, it does still log undo information. This allows better performance when performing large changes vs. a user database
  13. Table Variables: defined with the “at” sign – @
    • Declare @TempVar Table …
    • Not created “in memory” they live in TempDB, just like temporary tables
    • Cannot be assigned indexes, but you can assign constraints; including primary key or a unique constraint
    • Table Vars do not use statistics, even if a key or constraint is assigned. They use a set value (1) for cardinality
  14. Temporary Tables: defined with a hashtag – “#”, global Temp Tables are defined using a double hashtag – “##”
    • Create Table #TempTable (or ##TempTable) …
    • Unlike Table Variables, Temp Tables contain statistics
    • They can be indexed (clustered, non-clustered)
    • They can be defined as global access, where any session connected to the database can access them
    • Temporary tables are fully transactional aware, table variables are not
    • Temp Tables are limited in scope to the session that created them, once the session is gone, the temp table is dropped

# 3 & # 8.  TempDB contains a data file and a log file:

TempDB_Files_SSMS

# 4.  You cannot backup TempDB:

Backup_TempDB_SSMS

# 5.  TempDB uses Simple recovery model:

TempDB_SimpleRecoveryMode

# 6.  TempDB clears upon SQL Server restart:

Before:

TempDB_TemporaryTables_Before

After:

TempDB_TemporaryTables_After

# 7.  TempDB uses the Primary filegroup, you cannot add additional filegroups:

Secodary_FG

# 13 (d).  Table Variables do not contain statistics:

TableVar_Cardinality

# 14 (b).  Temporary Tables contains statistics:

TempTable_Cardinality

# 14 (c).  Temporary Tables can be indexed (Clustered and Non-clustered):

TempTable_Index

One additional item worth mentioning is the Version Store.  TempDB is used for Read Committed Snapshot Isolation (RCSI), and Snapshot Isolation where the row version is stored.  You can read more about TempDB and the version store here:

https://blogs.msdn.microsoft.com/sqlserverstorageengine/2008/12/30/managing-tempdb-in-sql-server-tempdb-basics-version-store-simple-example/

Impacts on TempDB

Since each data file has an associated meta-data file, whenever requests come in to TempDB the meta-data file has to be updated.  This process is single threaded and can cause contention when many requests are pending.  To enforce data integrity, SQL Server uses a lightweight lock called a latch.  This latch makes sure only one request thread is accessing the page at a time.  Because of this, multiple requests can be queued waiting for their turn to update the data page.

Using more than one TempDB datafile helps reduce this contention because there is another file available along with it’s associated meta-data file.  While the process to update the meta-data file is still single threaded, the presence of two, or more, datafiles spreads out the requests.

You can view wait contention against TempDB using the DMV:  sys.dm_os_waiting_tasks, the wait types we are concerned with are:  PAGELATCH_EX, and PAGELATCH_UP.  You can dive further into the process of waits by using the DMV:  sys.dm_os_wait_stats.

Sizing the datafiles of TempDB the same.  TempDB uses a proportional fill algorithm to populate datafiles equally.  Sizing the datafiles equally allows this algorithm to distribute data properly.

Large datafiles may impede sql server start up when a shutdown has occurred.  Using IFI (Instant File Initialization) can help this process.   Instant File Initialization is set up at the server level.

Memory Spills to TempDB

There are a few ways that memory requests can “spill” over into TempDB.  One way is due to a cardinality mismatch between the table and what the optimizer sees.  Memory grants are created by the optimizer based upon the cardinality (rows that satisfy the query).  This is directly related to the presence of up-to-date statistics.  If the optimizer (cardinality estimator) uses the wrong value for cardinality, then it may not grant enough memory to perform the work.  When this happens, it spills to TempDB for more memory.  Since reading from disk vs. RAM is far slower, this can become an issue depending on the query and execution frequency.

Best Practices

  1. Place TempDB on either a fast disk, RAID, or SSD. This should be an autonomous device, containing only TempDB
  2. Use multiple data files within TempDB
  3. Size the data files the same, along with their associated autogrowth values
  4. Don’t use percentages in data file autogrowth
  5. Configure Instant File Initialization (IFI) – for data files, log files cannot use IFI
  6. Watch for latch contention with DMV’s:
    • dm_os_waiting_tasks
    • dm_os_wait_stats
  7. Latch types:
    • PAGELATCH_EX
    • PAGELATCH_UP
  8. Use Table Variables and Temporary Tables sparingly
  9. Table Variables do not use statistics, wereas Temporary Tables do. This can have a direct impact on performance
  10. Keep statistics current
  11. Eliminate unnecessary sorting operations – large sorts might spill into TempDB

Summary:

In this post I’ve explained what you can, and cannot, do when setting up and maintaining TempDB.  I’ve also shown some best practices and described temp variables and temp tables.  TempDB is being used more and more by the SQL Server OS to perform many functions.  Therefore, setting up TempDB correctly, on the correct drive type, can dramatically increase its performance.

Resources:

Books:

Microsoft SQL Server 2014:  Query Tuning & Optimization, Benjamin Nevarez, McGraw-Hill, 2015.

SQL Server DMVs In Action, Ian W. Stirk, Manning Publications Company, 2011.

MSDN:

Best Practices:

https://blogs.msdn.microsoft.com/cindygross/2009/11/20/compilation-of-sql-server-tempdb-io-best-practices/

General:

https://msdn.microsoft.com/en-us/library/ms190768(v=sql.120).aspx

SQL Server 2014 Cardinality Estimator

Question:

Two-part:  What is different in the new Cardinality Estimator, and how can I check to see which one I’m using?

Answer:

The cardinality estimator (“CE”) was changed in SQL Server 2014.  New assumptions and logic were added based upon a modern workload (OLTP and Data Warehousing).  These new additions have shown, according to Microsoft, that most customers experience query performance enhancement, or at the least they remain the same; while a small amount of customers have seen regressions.

Microsoft doesn’t document much about the query optimizer, but their online documentation does explain some of what’s going on.  Their examples are shown in a “new” and “old” format of what happens now vs. what happened with the old estimator.

Microsoft has provided a couple ways to change the database configuration to enable the new cardinality estimator.  Trace Flags: 9481 and 2312 have been provided to disable/enable the new (2014) cardinality estimator.  You can apply these flags to an individual query using the OPTION (QUERYTRACEON  9481) statement.

SELECT * FROM Sales.SalesOrderDetail SOD
WHERE SOD.SalesOrderID = 3124
OPTION (QUERYTRACEON 9481); — use legacy CE

Alternatively, to set the entire database to use the new CE, you use the ALTER DATABASE statement.

ALTER DATABASE <DATABASE_NAME> SET COMPATABILITY LEVEL = 120;

You could also used DBCC TRACEON with a -1 option to make it global, or use the –T option in the startup configuration.  This allows you to use the features of SQL Server 2014, while keeping the legacy CE.

So getting back to what changes have actually been made, Microsoft examples show us the following:

  1. Ascending Key Issue: where statistics may not contain a specific value since they were last updated.  For instance, data being inserted on a daily basis, but the statisitics may not have been updated to reflect the latest data.  A Where predicate based upon a date for instance where the predicate contains a date later than what is reflected in the current statistics.
  2. Correlation between filtered predicates on the same table: in a query where you “AND” columns in the WHERE predicate that contain a correlation – such as City and ZipCode, the old CE would not draw a correlation between the two.  The new CE makes an assumption that there is a correlation between the two columns and uses a different algorithm to estimate the data distribution.
  3. Filtered predicates on different tables: rather than making the assumption that filtered columns from different tables “AND’ed” together are correlated, the new CE assumes they are not correlated.  This is based on the modern workloads tested.

Prior to migrating older databases (pre-2014) to the new CE in a production system, Microsoft suggests testing the database against the new CE.  This can be done in a testing environment using the two methods I mentioned earlier in this blog – Alter Database and Trace Flags.  This allows you to identify any potential regressions on a test box, and fix them prior to a production migration.  Any databases migrated from an older version to a 2014 instance will retain the legacy CE by default.

To view the compatibility level of databases on an instance, use the sys.databases DMV, and check the compatibility_level column:

SELECT DB.compatibility_level AS [Compatibility Level]
             , DB.name AS [Database]
FROM sys.databases DB;

To validate the CE being used within a query, you have a couple options.  One is to include the Actual Query Execution plan.  Consider the query:

SELECT *
FROM SALES.SalesOrderDetail SOD;

SSMS QEP capture 1

Right-click the “Select” operator and choose “Properties”:

Select Properties 1

Under “CardinalityEstimationModeVersion”,  you’ll see either 70, for the old CE, or 120 for the new CE.

*NOTE:  70 does not mean that the database is using the CE from SQL Server 7.0, 70 is used for all the legacy versions.  70 = old, 120 = new CE.

The other option is to view the XML from the Query Execution Plan:

SSMS QEP capture 1

 

 

Select Properties XML

Right-click in the result window and select “Show Execution Plan XML

Once you have the XML screen, find the tag:  StmtSimple.

You should see:

<StmtSimple … CardinalityEstimationModelVersion=“70”

To turn on / off the new CE on an individual query, you can use the QUERYTRACEON option.  Here is an example of turning “off” the new CE:

SELECT *
FROM HR.Employees
OPTION (QUERYTRACEON 9481); — 9481 old CE

Here is an interesting scenario using the DBCC TRACEON command to set the CE to the old CE:

DBCC TRACEON (9481, -1) — old CE, make it global
GO

Indeed, if I perform a query of any of my user databases on the instance, I’ll see that the legacy CE is
being used. However, if I use the QUERYTRACEON option show above, with the 2312 value, that query
execution plan will overwrite the global setting and use the new (120) CE. Here is the code:

DBCC TRACEON (9481, -1) — set to old CE globally
GO
USE TSQLFundamentals2008
GO
SELECT *
FROM HR.Employees
OPTION (QUERYTRACEON 2312); — option to use new CE for query

Oh, and while all this is happening, the Compatibility Level stays the same. Here is the query to prove it:

SELECT DB.compatibility_level AS [Compatibility Level]
            , DB.name AS [Database]
FROM SYS.databases DB
WHERE DB.name IN (‘AdventureWorks2014’, ‘TSQLFundamentals2008’)
ORDER BY [Compatibility Level] DESC;

Compatibility Level query output

 

*NOTE:  AdventureWorks2014 is showing a 110 Level because I changed it manually earlier.

So, TSQLFundementals database will use the legacy CE, but keep the Compatibility Level of 120 (2014), unless the querytraceon option is used.

Isn’t science fun!

Summary

In this blog I’ve shown you how to enable and disable the new cardinality estimator.  I’ve discussed how to set this at the instance level, and at the query level using querytraceon.  Finally, I’ve shown how to validate the version of the cardinality estimator being used by including the QEP and looking at the properties, or the XML version of the plan.

There is a lot more to discuss about the new cardinality estimator:  how it acts with filtered predicates; associations derived between related columns; and how joins are computed using the new estimator.  I’ll post another blog with some examples of these.

You can review a white paper written by Joseph Sack that goes into detail on the items I’ve mentioned above.  Check out the references for the link.

Knowledge is Power!

References

Microsoft SQL Server – Optimize your Query Plans with the SQL Server 2014 Cardinality Estimator,  written by Joseph Sack:

http://download.microsoft.com/download/D/2/0/D20E1C5F-72EA-4505-9F26-FEF9550EFD44/Optimizing%20Your%20Query%20Plans%20with%20the%20SQL%20Server%202014%20Cardinality%20Estimator.docx

MSDN:

MSDN blog on the new Cardinality Estimator (Two parts):

https://blogs.msdn.microsoft.com/psssql/2014/04/01/sql-server-2014s-new-cardinality-estimator-part-1/

https://blogs.msdn.microsoft.com/saponsqlserver/2014/01/16/new-functionality-in-sql-server-2014-part-2-new-cardinality-estimation/

SQL Server: Parameter Sniffing

Question:           

What is parameter sniffing, and is it a bad thing?

Answer:

Parameter sniffing is actually a very good thing, most of the time.  When it does go bad, it’s usually extremely bad.  I’ll define, demonstrate, and provide some ways to avoid the bad examples of parameter sniffing.

So what is parameter sniffing?  The optimizer uses statistics histograms to estimate the cardinality of a given query.  One of the things the optimizer does is evaluate the values of the parameters being passed in.  This is “parameter sniffing.”

When the query or stored procedure containing a parameter is executed, the value initially supplied for the parameter is used when creating the execution plan.  Subsequent calls to this query/SP will use this execution plan stored in cache.  So, if the parameter value is one with high selectivity and represents the data as a whole, this is good.  However, if a value is passed that is not representative of the data, the query may suffer in performance.

The best way to understand this is by creating an example.   Using the AdventureWorks20XX database, a good example table is the Sales.SalesOrderDetail.  I’ll perform a query to list the ProductID’s distribution, create a stored procedure with one input parameter, and then execute the procedure to demonstrate good, and bad, parameter sniffing.  Finally, I’ll discuss some of the most popular ways to avoid bad parameter sniffing scenarios.

Query the Sales.SalesOrderDetail table to get a count distribution for the ProductID values:

SELECT SOD.ProductID
            , COUNT(*) as [Count]
FROM   SALES.SalesOrderDetail SOD
GROUP BY SOD.ProductID
ORDER BY COUNT(*) DESC;
GO

This returns:

ProductID           Count

870                       4688
712                       3382
873                       3354
many more rows
943                       6
942                       5
897                       2

So, from this query, we can see that ProductID 897 has only two entries, while ProductID 870 has 4,688 entries!  That’s quite a difference, let’s see if it matters.

Let’s create a stored procedure to query the Sales.SalesOrderDetail table on ProductID:

CREATE PROCEDURE DBO.TEST (@prod_id int)
AS
SELECT *
FROM [Sales].[SalesOrderDetail] SOD
WHERE SOD.ProductID = @prod_id;

Now, let’s execute it and pass the parameter value of 897:

EXEC DBO.TEST @prod_id = 897;  — 2 records

key-lookup plan

The execution plan reveals a key lookup.  For this query, the plan is good.  Remember though, key lookups require a lot of IO, so this may not be the best plan in all cases.

Looking at the Select operator properties, we see that the parameter – @prod_id was set to 897 when compiled as we would expect.

Select operator properties

Now, let’s take a look at an execution with a different parameter value – 870, and for grins, lets also turn Statistics IO on:

– CLEAR THE CACHE
DBCC FREEPROCCACHE
GO
SET STATISTICS IO ON;
GO
EXEC DBO.TEST @prod_id = 870;  — 4688 records
GO
SET STATISTICS IO OFF;
GO

Whoa!  While the optimizer used the existing execution plan in cache (key lookup), take a look at the logical reads it performed – 14,300!

(4688 row(s) affected)
Table ‘SalesOrderDetail’. Scan count 1, logical reads 14380, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

The Select operator properties shows us that the compiled value for the plan generation was 897, but the runtime value was, of course, 870.

Select operator properties - high logical reads

This would be an example of parameter sniffing gone bad.  Executing the stored procedure with a value that has a high selectivity (2 records in our case) will cause the optimizer to choose an execution plan that may not perform well when other parameter values are entered.

In this example the compiled value of 897 produced the key lookup plan.  When the parameter value of 870 was entered, the optimizer said, “Oh, here’s an existing plan that uses a key lookup, cool, I’ll use it!”  And boom!  Instead of doing a key lookup of 2 records, we are now doing a key lookup for 4688 records!  The Statistics IO values returned shows the high IO due to the use of the same plan.

So, before we discuss what can be done to avoid this type of issue, let’s ask one more question:  What if I recompile the plan?  Okay, let’s do that and see what happens.

To cause a manual recompile of the stored procedure, we alter the stored procedure:

ALTER PROCEDURE [dbo].[TEST] (@prod_id int)
AS
SELECT *
FROM [Sales].[SalesOrderDetail] SOD
WHERE SOD.ProductID = @prod_id
OPTION (RECOMPILE);

Okay, so now when dbo.TEST is executed, it will recompile and use the newly passed parameter value to generate an execution plan.  Let’s execute it and check this functionality.

EXEC DBO.TEST @PROD_ID = 870  — 4688 records

Clustered Index Scan - option recompile

Now we see the newly generated execution plan based upon the parameter value of 897.  This plan uses a Clustered Index Scan operation.  Let’s take another look at the stored procedure execution using Statistics IO:

(4688 row(s) affected)

Table ‘SalesOrderDetail’. Scan count 1, logical reads 1242, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

As you might expect, the logical reads has dropped dramatically, lowing the IO thus improving performance for this parameter value.

The downfall of this type of operation, where we recompile each time the stored procedure is called, is the fact that recompiles have overhead, CPU cycles, etc.  So is this the only way to avoid the case of bad parameter sniffing?  Nope.

Solutions:

Options are available to assist with the parameter sniffing issue.  Of course, we could use the example above where we recompile the statement for each execution, but there are other alternatives.

  • OPTION (RECOMPILE) – already presented
  • OPTION (OPTIMIZE FOR (@VARIABLE = VALUE))
  • OPTION (OPTIMIZE FOR (@VARIABLE = UNKNOWN))
  • Use local variables

The OPTION (OPTIMIZE FOR (@VARIABLE = VALUE)) tells the optimizer to compile the statement using the “VALUE” vs. using whatever is passed as a parameter.  This option may be useful when calling the stored procedure over-and-over with a particular value.

Here is an example of Optimize for VALUE:

CREATE PROCEDURE DBO.TEST (@prod_id int)
       AS
       SELECT *
       FROM [Sales].[SalesOrderDetail] SOD
       WHERE SOD.ProductID = @prod_id
OPTION (OPTIMIZE for (@prod_id = 897));

If we execute the SP and pass value 870:

EXEC DBO.TEST @prod_id = 870;  — 4688 records
GO

We see the following plan:

Key Lookup Plan Option Value

The execution of the SP essentially ignores the value of 870 and uses the plan for 897 instead.  In this case, we wouldn’t want to use the value of 870 often.

The OPTION (OPTIMIZE FOR (@VARIABLE = UNKNOWN)) allows the optimizer to use statistical densities rather than statistical histograms.  All parameters use the same estimated record count.  Some query plans may not be optimal because densities are not as accurate as histograms.  When this option is used, parameter sniffing is turned off.

Here is an example of Optimize for Unknown:

CREATE PROCEDURE DBO.TEST (@prod_id int)
       AS
       SELECT *
       FROM [Sales].[SalesOrderDetail] SOD
       WHERE SOD.ProductID = @prod_id
OPTION (OPTIMIZE for (@prod_id = UNKNOWN));

If we execute this SP, passing both the 870 and 897 values, we see something interesting:

Clustered Index Scan - option recompile

Both values use the same plan – Clustered Index Scan.  Looking at the XML output of the plan, we see the StatementEstRows under tag: StmtSimple:

<StmtSimple StatementCompId=”3″ StatementEstRows=”456.079″

This estimate rowcount is the same for both parameter values passed.  As the definition states:  “the same estimated row count” is used for all parameters.

Use Local Variables – when you assign the parameter passed to a local variable in the stored procedure code to a local variable, the optimizer uses the same number of estimated rows for each parameter.

When we use this option, we are effectively “turning off” parameter sniffing.

Here is an example of local variable usage:

CREATE PROCEDURE DBO.TEST2 (@prod_id int)
       AS
       DECLARE @PRODUCT_ID INT = @prod_id  — assign param to local variable
       SELECT *
       FROM [Sales].[SalesOrderDetail] SOD
       WHERE SOD.ProductID = @PRODUCT_ID

Summary

In this post I’ve described parameter sniffing, and shown that it is a normal operation performed by the optimizer.  Using the knowledge gained in these examples, and one or more of the solutions presented, you should be able to identify and remedy any suspect stored procedures.  The takeaway from this is to have alternatives when troubleshooting stored procedures where certain parameter values give less-than-optimal results.

Resources

Microsoft MSDN – SQL Server:
https://msdn.microsoft.com/library/mt590198.aspx

Microsoft MSDN – Stored Procedures:
https://msdn.microsoft.com/en-us/library/ms187926.aspx

Microsoft MSDN – Statistics and Plan Cache:
https://msdn.microsoft.com/en-us/library/cc293624.aspx
https://msdn.microsoft.com/en-us/library/ms190397.aspx

Book:
Microsoft SQL Server 2014: Query Tuning & Optimization, Benjamin Nevarez, McGraw-Hill, 2015.

SQL Server: Query Optimizer

Question:

How does the query optimizer choose a plan for my SQL query?

Answer:

First, a disclaimer: Don’t perform these commands on any production or unit test server – EVER. These scripts are used for training; so DO NOT package them up and run them in production!

The optimizer in SQL Server is a cost based optimizer. That is, the optimizer assigns a cost to all the candidate plans generated within a period of time, and choses the lowest cost plan for execution.  There are many steps, and bits of information that the optimizer uses to do this.

Thinking like the optimizer:

    • Does the query qualify for a Trivial plan or is Full optimization necessary?
    • Does the underlying table have a Clustered index? Or is it a Heap?
    • Are the columns contained in the Where predicate indexed?
    • Is the predicate sargable?
    • Do the underlying Index[es] have current statistics?
    • Are there joins involved in the query?
    • Is there an Order By clause in the query?
    • Are aggregates being used in the WHERE predicate?
    • others

The query optimizers job is to find the best overall plan in a given amount of time.  Since queries with multiple joins could have thousands of potential candidate plans, generating a “perfect” plan would be time, and memory, prohibitive.  Many times the optimizer will stop the optimization process with a “good enough plan found.”

Your SQL statement goes through four stages on its way to being executed:

Parse -> Bind -> Optimize -> Execute

Parsing: the SQL statement is checked for correct syntax, spelling, and keyword usage.  A parse tree is returned as  a result of this operation.

Binding: the parsed tree is then checked against database metadata to see if columns exist, and if the query is semantically correct.

Optimize: once the query tree is validated, it is passed to the optimizer which will create either a trivial plan, or perform full optimization where several alternate plans are evaluated. This involves phases: Simplification, Trivial Plan, Auto-Stats, Exploration (three stages), Convert to executable plan.

Execution: the optimizer chooses the plan for execution.

Let’s take a look at each bullet point above, and explain a few basic actions of the optimizer with examples.

Trivial plan or Full optimization?

One of the first operations of the optimization process, is to determine if the query qualifies for a Trivial plan.  A trivial plan is one in which there really isn’t any other way to execute the query – like:

SELECT * FROM SOME_TABLE;

The optimizer will go out and scan the table and return the results to the client. Even a query like this, against a Heap (table without a clustered index), results in a trivial plan:

SELECT CT.CustomerID
              ,CT.AccountNumber
FROM DBO.CUSTOMER_TEST CT
WHERE CustomerID = 10

As the query becomes more complex, with table joins, or complex Where predicates, the optimizer may choose a Full optimization plan. Defining each step of full optimization is beyond the scope of this blog.  I will, however, touch on a couple points involved.

Does the underlying table have a Clustered Index or is it a Heap?

When a table is created, it is created as a Heap structure by default. In a nutshell, a heap is an unorganized structure where records are not stored in any particular order.  By comparison, creating a table and assigning a clustered index implements a B+ tree structure to the data.  Heaps are great if you’re doing massive inserts, or using them as log files.  Heaps are not so good for running OLTP queries though.  When you query a heap table, it has to scan the table to find a particular value identified in the Where predicate.  Imagine that on a table that contains > a million rows!  Here is something to remember:  Scan = BAD, Seek = Good!  At least 99% of the time this is true.

Heaps are also fragmentation magnets, it has to do with the page allocation and deallocation mechanism. So, it is best not to use a heap; unless you have a special need, and have tested their use extensively (that is, before you go to production!).

Creating a table and assigning a carefully chosen clustered index is usually the best alternative.  The clustered index stores data in the order of the clustering key.  Additionally, in a clustered index, the leaf page is the data.  Conversely, a non-clustered index has a Key or Row ID that points to the data in a clustered index or a heap, respectively.  This is known as a “bookmark lookup” and it is an expensive operation.

A final note on clustered indexes, a primary key constraint is created as a clustered index by default. You can also explicitly create the primary key as non-clustered, making it a unique constraint as well if desired.  You may experience this scenario when a surrogate key is created as the clustered index and a primary key is created using a non-clustered, unique constraint on a column, or set of columns, that identifies the business rule being applied.

T-SQL code for finding heaps:

USE AdventureWorks2014        — enter your database
GO
SELECT T.name AS [TableName],
                I.name AS [IndexName],
                I.type_desc AS [IndexType]
FROM SYS.indexes I INNER JOIN SYS.tables T ON 
                I.object_id = T.object_id
—  WHERE I.type_desc = ‘HEAP’  — uncomment to view HEAPS only
ORDER BY TableName, IndexType
GO

Are the columns contained in the Where predicate indexed?

If not, they probably should be. When building the index on the predicate columns, place the equality predicates first, that is: Where Column_a = b and Column_b > c … Column_a would be place as the first, or left-most, column in the index.  If you have a predicate with several columns, SQL Server can use a mechanism called index intersection.  This works when there are other indexes present and they are joined together to satisfy the query.

A “covering index” can also be used. The covering index is created using the columns of the predicate and including those columns also present in the Select statement.

SQL:

SELECT COL1, COL2, COL3
FROM MyTable
WHERE COL4 = value and COL5 = value;

Create index IDX_Covering on MyTable (col4, col5)   — Where predicate columns
Include (col1, col2, col3);                                              — Select statement columns

Since the covering index contains all the data columns needed to satisfy the query, it will only access the index. There is no need to access the [base] table thereby reducing I/O.

Is the predicate Sargable?

When a predicate is [S]earch [arg]ument able, it is able to use an existing index to satify the query.  Some predicates are non-sargable due to data type conversion, or other TSQL functions within the predicate.  This would render any existing index unusable by the optimizer.  No index = no seek operations.  Remember: seek = good, scan = bad!

Generally, these operators are Sargable: =, >, <, >=, <=, BETWEEN and LIKE ‘xyz%’

These aren’t: !=, !<, !>, NOT IN, NOT EXISTS, NOT LIKE, and LIKE ‘%xyz%’

Do the underlying indexes have current statistics?

Since SQL Server uses a cost-based optimizer, it relies heavily on current statistics when analyzing a SQL statement.  Statistics are created automatically, and SQL Server can auto-generate new statistics if they qualify as being out of date.  One of the steps the optimizer performs during full optimization is to load statistics.  The optimizer will then have access to values such as:  the table size, record distribution, cardinality, etc. which it uses to derive alternative plans during the Exploration phase of optimization.

Statistics are created automatically by setting the AUTO_CREATE_STATISTICS and AUTO_UPDATE_STATISTICS options.  These are both set “on” by default, and in most cases this is fine.  The DBA may have a particular reason for turning one, or both, of these options off.  They can also be ignored during index creation – using a WITH statement.

CREATE NONCLUSTERED INDEX IDX_Person_ModifiedDate
ON Person.Person (ModifiedDate)
WITH (STATISTICS_NORECOMPUTE = ON);

It is best to check with your DBA concerning statistics questions for your database.

Are there joins present in the query?

If so, are there indexes placed on the columns used in the join predicate? If not, you may see significant performance issues, specifically with I/O.  Indexes built on the foreign key columns will enhance performance similarly to creating an index on a predicate column on a regular table.

Is there an Order By used in the query?

Sort operations can be expensive. One way to avoid a sort operation is to use an index with the prescribed sort order already in place.

Consider the query:

SELECT [OrderDate]
              ,[PaidDate]
              ,[CustomerID]
FROM [JProCo].[dbo].[SalesInvoice]
WHERE CustomerID = 20
ORDER BY PaidDate DESC;

Note that there is a primary key and a non-clustered index present on the table – SalesInvoice. No index exists on CustomerID, nor the order by column – PaidDate.  The plan below shows a sort (60% of the cost)  and a Key lookup into the clustered index.  If we were to build an index on the Where predicate column, and the order by column, in the sort order (decending) specified, maybe we could eliminate the sort and the key lookup.

sql without index

Here is the create index statement:

CREATE INDEX IDX_TEST_PAIDDATE
   ON DBO.SALESINVOICE (CUSTOMERID, PAIDDATE DESC);

Here is the new execution plan:

sql with index noncovering

Well, the sort operator is gone, but we still have the key lookup. Let’s see if we can improve this by creating a covering index, which will be the same as the previous index definition, but include the column: OrderDate.  First, we will delete the index: IDX_TEST_PAIDDATE, and recreate it as a covering index.

Here is the create index statement:

CREATE INDEX IDX_TEST_PAIDDATE
   ON DBO.SALESINVOICE (CUSTOMERID, PAIDDATE DESC)
INCLUDE (ORDERDATE);

Here is the new execution plan using the covering index:

sql with covering index

That’s better! Now we have an index [only] seek with no key lookup into the base table, and no sort operation.  Eureka!

Please note that the optimizer still chooses what it considers the best plan. You still have to do the work of testing, and re-testing scenarios to see if an index might help your query.

Are aggregates being used in the WHERE predicate?

Aggregate functions Sum(), AVG(), Date() are all great tools, but may also cause lackluster performance when used under certain circumstances. If you are writing a report, or pulling data for a chart, these functions can be very helpful.  However, if you are running an OLTP application, with very tight SLA’s on your transactions, using these functions can decrease performance.

Consider the following queries:

— #1
SELECT SOH.SalesOrderID
             , SOH.OrderDate
FROM Sales.SalesOrderHeader SOH
WHERE  OrderDate >= ‘2011-01-01 00:00:00’
    AND
               OrderDate <  ‘2012-01-01 00:00:00’;

— #2
SELECT SOH.SalesOrderID
             , SOH.OrderDate
FROM Sales.SalesOrderHeader SOH
WHERE YEAR(OrderDate) = 2011;

Both return the same values, but query #2 uses the YEAR() function to retrieve records for the year 2011. In contrast, query #1 uses a range predicate to do the same thing.  The execution plans are nearly identical – they both use clustered index scan operations.

The optimizer will recommend a non-clustered index for query #1, using OrderDate and including SalesOrderID columns. This new index will result in a new execution plan using an index seek operation vs. an index scan.  Query #2 will receive no such recommendation, as an index on query #2 won’t help.

Summary

I’ve presented some of the processing that the query optimizer uses to select an execution plan. Query tuning is a very broad topic, and I have mearly scraped the tip of the iceberg here.  Hopefully, this will give you some insite into your own queries, and help eliminate some low-hanging fruit you may have right now.

Many things can affect performance: network issues; disk latency; waits – locking and latching; memory pressure, etc. Knowing how to diagnose the problem quickly is the key, so you must know what to rule out during the troubleshooting process.

I’ll be writing more blogs on the optimization, wait stats, and locks and blocking, with corresponding TSQL to assist you in finding these performance bottlenecks.

Knowledge is Power!

PowerShell: Create a Database Inventory

Question:

How can I use Powershell to create an inventory of my databases?

Answer:

You can use SQL Server’s SMO structures to create an inventory of database objects. Once generated, you can output the result to the terminal in grid format, or others such as CSV or text files.

Specifically, using the name space: Microsoft.SqlServer.Management.Smo.  The Server class within this name space provides a wealth of information for the user to query.

I prefer building functions vs. simple scripts since you can import them into a module (function library) as you build your repertoire.   Start by declaring the function:

function Get-DBInventory
{
[CmdletBinding()]
[Parameter (Mandatory = $True,
ValueFromPipeline = $True )]
param ([string] $computerName,
[switch] $ToScreen)

I’ve defined our first parameter as a mandatory parameter as well as being able to pipe the value in from another function on the command line. The second parameter is a switch type, used as a Boolean.  The $ToScreen variable can be set to $true or $false depending on where the user prefers the output to be generated.  By choosing the argument -ToScreen at the command line, the variable will be set to $true.

Next, we need to import the SQLPS module:

BEGIN
{
Import-Module sqlps -DisableNameChecking | Out-Null
}

Next, we enter the Process block to do actual work. I’ll explain the snippets as we go.

PROCESS
{
$server = New-Object -TypeName Microsoft.SqlServer.Management.Smo.Server -ArgumentList $computername -ErrorAction stop
$folder = “c:\temp”
$filename = “DB Inventory.csv”
$fullpath = Join-Path $folder $filename
$result = @()

In the example above, I’ve created a new object using the Smo.Server and assigned it to a variable $server. The –ArgumentList $computername tells PowerShell what server to use, and the –ErrorAction stop is what I want to happen if there is an error creating the object.  This is necessary when using a try/catch block.

I’ve also set up a hard-coded output file name and path (c:\temp) to place an output file should the user select an output file. $result is defined as an empty array to hold the output of our next command.

foreach ($db in $server.Databases)
{
Write-Verbose “Working DB: $db”
$item = $null
$hash = @{ “DBName”              =     $db.name
“CreateDate”      =   $db.CreateDate
“Owner”    =    $db.owner
“RecoveryModel”     =    $db.recoverymodel
“Size/MB”       =    $db.size
“DataSpaceUsage/KB” =    $db.dataspaceusage
“IndexSpaceUsage/KB” =    $db.indexspaceusage
“Collation”   =    $db.collation
“UserCount”   =    $db.users.count
“TableCount”       =    $db.tables.count
“SPCount”       =    $db.storedprocedures.count
“UDFCount”   =    $db.userdefinedfunctions.count
“ViewCount”           =    $db.Views.Count
“LastBUPDate”      =    $db.lastbackupdate
“LastDiffBUPDate”     =    $db.lastdifferentialbackupdate
“LastLogBUPDate”     =    $db.lastlogbackupdate
}
$item = New-Object PSObject -Property $hash
$result += $item
}

The for-each loop above goes through all the databases found in the $server object variable. I’ve defined a hash (key-value pair) to assign the output heading (key) and the information (value) found in the database variable – $db, using the properties found in the Database class.  *NOTE:  you can go to MSDN and look up these class[es] and specify your own properties to query.

An $item variable is created as a PSObject and assigned the values of the $hash. The $item variable is then appended to the array – $result.  The array will contain all the defined values for the specific database being queried.

Next is the output generation code.

if ($ToScreen)
{
$result |
Select DBName, createdate, owner,
recoverymodel, size/mb, dataspaceusage/kb,
indexspaceusage/kb, collation, usercount,
tablecount, spcount, udfcount, viewcount, lastbupdate,
lastdiffbupdate, lastlogbupdate |
Out-GridView -Title “SQL Server Inventory: $computername”
}
else
{
$result |
Select DBName, createdate, owner,
recoverymodel, size/mb, dataspaceusage/kb,
indexspaceusage/kb, collation, usercount,
tablecount, spcount, udfcount, viewcount, lastbupdate,
lastdiffbupdate, lastlogbupdate |
Export-Csv -Path $fullpath -NoTypeInformation
}

The Boolean $ToScreen is checked for $true or $false, if $true, then the output is sent to the screen. If $false, then the output is sent to a .CSV formatted file.  In any case, the array variable, $result, is piped to the Select-Object cmdlet (Select) where the output columns (the key) are selected.  These columns are then piped to the appropriate output cmdlet: Out-GridView or Export-CSV for viewing on the screen or saving to a .CSV file, respectively.

Finally, we close out our function definition:

} # End PROCESS block
END {}
} # End Function

Here is the grid output of the DB inventory on my laptop. *Note:  I’ve removed the owner and createdate columns.  You can use the Add Criteria to filter the returned data, or the filter text box at the very top left corner.  You can also select which columns to display by right-clicking any of the column headers.  (click on the image below to enlarge)

DBinventory