Monday, April 25, 2011

Wait Stats and The Rest of the Story

In my interactions with customers and evaluators of SQL Sentry, I occasionally come across DBA's who have a disproportionate focus on SQL Server wait stats as their primary source for performance analysis.

I understand the attractiveness of wait stats as a way to monitor and improve query performance.  However, I think it is very important to note that while wait stats are a great place to start when looking for performance bottlenecks, very seldom do they provide a complete picture.  This is why our intelligent wait stats analysis is part of, but not the foundation of, our product.  There are many wait stats that have no relevance to performance at all and can tend to clutter the picture.  Others aren’t obvious as to what specific resource or issue they are related, or can even be related to multiple issues at the same time.

“What am I waiting on?” vs. “Why am I waiting?”

Wait stats tend to suggest what a query or queries are waiting for, but not why they are waiting.  In order to get the complete picture, wait stats need to be correlated with other metrics.  This means that ultimately, methods or tools that focus almost exclusively on wait stats routinely require the use of additional tools and manual effort to get to the root cause of an issue and resolve it.  This tends to become a rather inefficient means of performance tuning.

A common example we’ve seen in several environments involves a server showing high disk IO waits.  If you were to only look at wait stats, seeing frequent high disk IO waits would most likely lead you to suspect a bottleneck with the disk system on the server.  Unfortunately, it is quite possible that simply looking at these waits could lead you down the wrong path.  This is because, commonly, the root cause of the performance bottleneck on the server was not the disk system at all, but memory pressure.

Taking a look at the bigger picture and examining other metrics such as Page Life Expectancy quickly revealed that memory pressure was resulting in excessive paging and thrashing of the disk system, which in turn resulted in high Disk IO waits.  Using SQL Sentry, we could easily correlate a spike in Disk IO waits with a drop in page life expectancy, cache hit ratios, and increased read or write paging activity.  In a case like that, the real bottleneck was occurring in memory and the disk performance issues seen in the wait stats were merely a symptom of the issue rather than the cause.

High parallelism waits are another common issue that often is not as simple to diagnose and resolve as it may first seem.  In fact, CXPACKET waits are often the highest wait type on many systems.  Paul Randal (blog|twitter) conducted a survey to this effect recently on his blog at http://www.sqlskills.com/BLOGS/PAUL/post/Survey-what-is-the-highest-wait-on-your-system.aspx

First you need to know that CXPACKET waits refer to parallelism.  This is only one of many wait stats that aren’t readily obvious what they refer to by name alone.  Next, parallelism waits are not always so directly attributed to CPU bottlenecks as you might think.  Simply setting your MAXDOP may or may not improve the performance of the query running at the time of the wait stats collection, but that may still only be treating a symptom instead of the root cause.  I’ve seen that other resource pressure like disk or memory can actually lead to increased parallelism waits too.  Again, without a complete picture of your server’s performance and activity like SQL Sentry provides, you have no choice but to resort to additional tools and manual effort to track down and resolve the issue effectively.

Query Level Optimization

Common root causes of parallelism waits can be missing indexes, outdated statistics and generally inefficient queries.  Examining wait stats at the query level may tell you what the query was waiting on, but it still doesn’t tell you why.  I have found that execution plan analysis (one of the new features in SQL Sentry Version 6) is a much more effective means of investigating query performance than query-level wait statistics.  Our Plan Analysis features not only collect execution plans for long-running and poorly-performing queries, but also provide greatly expanded visibility into the plans themselves.  Since complex query plans can become extremely difficult to diagnose, having an effective tool to analyze them can be as important as having the plan in the first place! 

With SQL Sentry, I can use the combination of performance counter data, wait stats, and top SQL collection to identify my worst performing queries and resource bottlenecks, then dig deeper using Plan Analysis to resolve those issues as discussed in Product Manager Greg Gonzalez’s (blog|twitter) blog post at http://greg.blogs.sqlsentry.net/2011/02/sql-sentry-v6-plan-analysis-features.html.  All of this from one tool!

Hopefully this has served to shed some light on the pros and cons of wait stats analysis, and the thinking behind many of the features SQL Sentry provides.  Ultimately, everything we do is designed to make all our lives easier as SQL Server DBA's and developers.