Wednesday, February 1, 2012

Part 5: An Introduction to SSAS Performance and SQL Sentry Performance Advisor for Analysis Services

Up to this point in the series we’ve been focusing on common activity in Analysis Services, how to identify the bottleneck for that activity, and how to resolve the bottleneck. In the last post we specifically discussed how to identify if the Storage Engine (SE) or Formula Engine (FE) was the larger bottleneck. Now I want to talk more about identifying bottlenecks with the server’s physical resources themselves.

The four basic areas to investigate when it comes to server resources are:

· CPU
· Memory
· Disk
· Network

A lot of these issues, and methods for identifying them, are common to Windows server performance in general, but I’ll include pertinent SSAS specific performance details and metrics too.

CPU
With regards to any Windows server, there are a handful of counters that provide good indication that the bottleneck with your server may be related to the CPU(s).

· Processor: % Processor Time
· System: Context Switches/sec
· System: Processor Queue Length

There is already tons of information available online with regards to general processor monitoring and optimization so I won’t say much more here. While there is no magic number that indicates what “good” CPU utilization is, you generally want to get the most utilization without overburdening the system, so look for sustained periods of 100% utilization on one or all cores to suggest a bottleneck here.

The other two counters, when abnormally high for sustained periods, usually suggest an issue with too many parallel operations on the server.
So let’s take a look at some Analysis Services performance metrics that shed some light on processor utilization. There are two sets of SSAS performance counters that apply to the formula and storage engines.

· Threads: Query pool
· Threads: Processing pool

Don’t let the names confuse you. The query pool refers to FE activity. The processing pool not only refers to processing, but any SE activity. Each of these categories contains counters for Busy, Idle, Job Rate, and Queue Length. They allow you to see the thread activity for each engine. If you are seeing consistently high queue lengths, but not high CPU utilization you many want to adjust your MaxThreads and/or CoordinatorExecutionMode properties for your SSAS instance. More details on these settings are covered in section 6.11 of the SSAS 2008 Performance Guide, as well as tip 8 of the SQLCAT Analysis Services Query Performance Top 10 Best Practices.

Remember that the FE is single threaded, so increasing the query pool setting may not improve performance of any one query, but may improve the performance in handling multiple simultaneous requests.

Memory
There are three groups of metrics when it comes to monitoring Analysis Services memory.

· Overall usage – how much total memory is SSAS using on the server.
· Cache hit ratios – how efficient is the use of that memory.
· Cache activity – what is happening to the memory.

SSAS Memory Usage

· Memory: Memory Usage KB

This is the total memory usage for the server process, and should be the same as the Process: Private Bytes counter for msmdsrv.exe.

NOTE: Do NOT rely on Task Manager for an accurate picture of memory usage.

· Memory: Cleaner Memory KB
· Memory: Cleaner Memory shrinkable KB
· Memory: Cleaner Memory nonshrinkable KB

These counters refer to the background cleaner for SSAS. The first counter refers to the amount of memory known to the background cleaner. That memory is then divided into shrinkable and nonshrinkable memory. This describes what portion of that known memory is subject to purging by the cleaner based on memory limits. The cleaner value is likely to be a bit lower than the total usage value, but it’s important to know because this lets you know how much room you have to actually work with when it comes to memory management. The limits the cleaner works with are defined by properties indicated by the following two counters.

· Memory: Memory Limit Low KB
· Memory: Memory Limit High KB

A great explanation of these properties and counters, along with real world examples of their use is covered in Greg Gonzalez’ (b|t) blog post, Analysis Services Memory Limits.

SSAS Cache Hit Ratios
Remember from previous posts that the FE and SE each have caches. There is the Calculation and Flat caches for the FE, and Dimension and Measure Group caches for the SE.

The counters that allow you to determine cache efficiency are all in the Storage Engine Query category.

· Calculation cache lookups/sec, hits/sec
· Flat cache lookups/sec, hits/sec
· Dimension cache lookups/sec, hits/sec
· Measure group cache lookups/sec, hits/sec

While there is no persistent cache hit ratio counter itself for these caches as there is for SQL Server, these metrics will allow you to calculate the ratio for each cache for a given point in time.

SSAS Cache Activity
One last group of counters to consider relate to overall cache activity.

· Cache: Inserts/sec
· Cache: Evictions/sec
· Cache: KB added/sec
· Memory: Cleaner: Memory shrunk KB/sec

These metrics give a pretty direct indicator of memory pressure on the server. If the Evictions/sec and/or Cleaner Memory shrunk KB/sec are consistently non-zero, you likely have memory pressure on the server. The Cleaner: Memory shrunk counter in particular indicates that you are exceeding your defined memory limits described earlier.

How do I improve cache usage?
In addition to improperly configured memory limits as discussed in Greg Gonzalez’ blog referenced earlier, another common Analysis Services memory related performance issue is a cold cache. You’ll remember in Part 2 of this series we discussed cube processing and the fact that data in cache becomes invalidated and flushed after processing occurs. This means that the next time a query is executed, no data will be in cache, resulting in queries to the file system, and a dramatic hit to that query’s performance. A common scenario involves nightly processing that leaves the cache cold in the morning. The first person in the office, which often times is someone high up the management chain, runs an important report that takes forever to run. The point here is that some of the people you least want to experience performance issues may be the ones most likely to see them under this scenario. So what do you do?

The answer is cache warming. There are different ways to warm the cache. The simplest is to run a couple of the most commonly used queries after processing to pull the most likely needed data into cache. A more in depth description of the process is covered in the Identifying and Resolving MDX Query Performance Bottlenecks whitepaper I mentioned in part 1, as well as an outstanding blog post by Chris Webb (b|t) on Building a Better Cache Warmer.

Disk
In order to determine if the disk system is a bottleneck for your SSAS instance, you need to first verify that Analysis Services is indeed accessing the disk system. I mentioned in the first post that the SE accesses the file system. So check the previously mentioned counters to verify the SE is active. A more specific counter is:

· MSAS: Storage Engine Query: Queries from file/sec

Now remember, unlike the relational engine, SSAS data may be in the Windows file cache. This means that even when the above counter is non-zero, it alone does not guarantee physical disk IO. You’ll want to examine the following three counters to get an idea of how much of the SE activity is actually reading from disk as opposed to the Windows file cache.

· MSAS: Storage Engine Query: Data bytes/sec
· Physical Disk: Disk Read Bytes/sec
· Cache: Copy Reads/sec

Be sure to account for activity outside of SSAS when using these metrics. For a much more detailed explanation with great screen shots of PA for SSAS, see Greg Gonzalez’ blog post on the subject.

So once you’ve determined Analysis Services is incurring physical disk IO, what should you check to ensure the disks are performing optimally? There are many different counters available for disk performance and some are more useful than others. For a long time, disk queue length was considered an important metric, but as server storage grew to include more and more spindles, was moved to SAN’s, or included SSD’s, this metric has become less meaningful. A more universal indicator for disk performance is latency.

· Physical Disk: Avg. Disk sec/Read
· Physical Disk: Avg. Disk sec/Write

Optimally these should remain below 10 ms. As you approach 20 to 30 ms or more, you’re going to notice performance issues related to the disk system. This isn’t specific to SSAS, but more of a general server guideline. There is tons of material available online that focus on disk performance.

How do I improve Disk Performance?
Partitions are essential to optimizing disk performance. Partitions based on the way the data is most likely queried, such as by timeframes, will help reduce the amount of data that must be retrieved from disk for a given query. It can also help you take advantage of parallelism when multiple queries are submitted, as well as when processing. Remember to distribute your partitions properly to spread the load across multiple disks.

You may also want to disable Flight Recorder. Without going into the pros and cons of why you do or don’t want Flight Recorder, many articles suggest a performance improvement by disabling it. It is basically a file based trace on your system and will increase IO. It can be disabled in the properties for the SSAS instance.

There are a couple other things worth mentioning not specific to SSAS. First note the location of your cube’s files. If they are sharing spindles with system files, or other busy databases, you’re likely to run into contention. The same thing applies to SAN allocation. This is often harder to investigate without the help of your SAN administrator, but be sure you’re not sharing busy spindles on the SAN either.

Finally, watch for partition misalignment issues that can have a significant impact. In Performance Advisor’s Disk Activity view, partitions highlighted in red are misaligned. Review the Disk Partition Alignment Best Practices for SQL Server whitepaper by the SQLCAT team for more details.

Network
Just as with the relational engine, network is the component that probably offers the least visibility. The problem is just as likely to be outside your server. The network itself can be slow, or the bottleneck may be on the client end. That said, there are some metrics to identify if the problem is a local network issue.

With any windows server take a look at:

· Network Interface: Bytes Received/sec
· Network Interface: Bytes Sent/sec
· Network Interface: Output Queue Length

This will give you an idea of the traffic on the NIC(s), and let you know if there is a backup in output.
For SSAS we can at least identify what kind of traffic we are sending through the pipe:

· Processing: Rows read/sec – tells the rate of rows read from all relational DB’s.
· Storage Engine Query: Rows sent/sec – tell the rate of rows sent from the server to clients.

This should give you a better visibility into how your network cards are performing and how much of that is related to Analysis Services activity.

So what now?
Hopefully by now, if you’ve read the entire series, you have a much better understanding of how SSAS works under the covers, and how to identify an SSAS performance bottleneck. At this point you should be able to make good use of the various whitepapers and blog postings that I’ve referenced throughout. If you’re already a SQL Sentry Performance Advisor for Analysis Services user, you should find this has served to jump start your use of the product to quickly interpret what is being provided and optimize your SSAS performance. If you haven’t yet tried the product, what are you waiting for? Download an evaluation license of the entire BI Suite and let me know what you think.

Monday, January 23, 2012

Part 4: An Introduction to SSAS Performance and SQL Sentry Performance Advisor for Analysis Services

In my last post I talked about how to detect performance issues related to unoptimized MDX and different strategies based on the type of issue. A big part of ensuring that you are as efficient as possible with your troubleshooting efforts is to first identify if the bottleneck is in the Formula Engine (FE) or Storage Engine (SE). Technically unoptimized MDX itself is likely to manifest as an FE bottleneck where an SE bottleneck really resides more with the underlying cube architecture. That said, all of this really ties back to the first post and the anatomy of an MDX query.

image_thumb2

In part 1, I walked through the anatomy of an MDX query and described what role the FE and SE play in the handling of a request. When troubleshooting slow MDX performance, it’s not likely that the bottleneck is entirely with the FE or entirely with the SE. So how do you tell where the biggest bottleneck is, so you can ensure you are getting the most out of your tuning efforts?

Where is SSAS spending most of its time?

To put it simply, if SSAS is spending most of its time in the FE while handling a request, that’s where you should start your troubleshooting; likewise for the SE. But how can you accurately measure something like that? The most accurate way using native tools is a trace. There are certain trace events in SSAS that map closely to the activity we covered in the “anatomy of MDX” discussion. I’ll break them down here:

SSAS Trace Events

Command Begin/End – This signifies the beginning/end of an XMLA command. This is often associated with processing activity.

Query Begin/End - This encapsulates MDX, or other queries like DMX, SQL, etc. from request submission to results returned.

Query Subcube/Verbose – The FE requests data from the SE

Get Data From Cache – There was a cache hit to either the SE or FE caches. The event subclass will tell you which.

Progress Report Begin/End – This indicates SE file system activity either for processing or querying. The event subclass will tell you which.

Get Data From Aggregation – This indicates the SE made use of an aggregation to return data from the file system.

There are other events and additional detail that is provided within the trace, but this covers the primary events we’ve discussed. Note there can be many of these events, excluding the command or query begin and end, for a single request and it can get a bit complicated fairly quickly, especially if there are multiple requests hitting the server at the same time. However, if you are troubleshooting a particular query and run a trace, you do have the information to add up the duration for each event and come up with an idea of where most of the time is being spent.

Fortunately, SQL Sentry PA for SSAS does all of this for you! You may have noticed a rather unique chart on the PA for SSAS dashboard.

image

This chart monitors all of this activity for you continuously at a fraction of the overhead a Profiler trace tends to impart. It summarizes this information in real time and retains history so you can quickly and easily see where most of the time is being spent. It even breaks it down into more detail for you. Serialization is handled by the FE and is related to Non Empty behavior as discussed in an earlier post. Processing and SQL queries are handled by the SE. By the way, SQL queries are likely due to data reads for processing, but can also be seen if you are using ROLAP partitions.

For a further breakdown of this activity on a query by query basis in PA for SSAS, refer to the Top Queries tab. Not only will you see if your query experienced more FE or SE time, but you can expand the view for a complete breakdown by Measure Group, Partition, Aggregation, and Dimension. More information on this functionality is available on the website.

image

In the next post I’ll dig into troubleshooting server resource bottlenecks specific to SSAS. We’ll take a closer look at the hardware side of things and discuss some configuration options to ensure your server is properly tuned.

Tuesday, January 17, 2012

Part 3: An Introduction to SSAS Performance and SQL Sentry Performance Advisor for Analysis Services

In my last post I discussed SSAS processing. I showed you some counters that can help identify when processing is taking place, and at what rate. I then gave some suggestions as to how you might be able to optimize your processing strategies in order to improve overall performance on your server. In this post I’m going to talk about unoptimized MDX code as the source of your performance issues.

Unoptimized Code

There are quite a few resources available on how to write MDX, so I won’t be going into that here. The point I want to make is that there are ways you can identify queries running on your SSAS instance that may benefit from optimization, even without actually knowing MDX. In this section I’ll mention a handful of key performance counters that will allow you to identify issues related to query optimization. You can then take this knowledge to the developers of the queries with useful feedback as to how to improve performance.

Monitoring MDX

image

Specific to MDX, there are a few performance counters that can provide a lot of insight. I’ll actually group them into three general issues. The first group are:

· MDX: Total cells calculated

· MDX: Number of calculation covers

· MDX: Total Sonar subcubes

A high number for any of these metrics while a query is executing suggests the query is using cell by cell calculations instead of a block-oriented, also known as “subspace”, evaluation. So what does that mean? To stick with our SQL example from before, you can relate cell by cell calculations to using cursors in SQL. This is often an inefficient and resource intensive way of processing data. Maybe you’ve heard of the term ”RBAR” in SQL development, which stands for “Row By Agonizing Row”. Well you could call this CBAC, or “Cell By Agonizing Cell”.

The MDX performance whitepaper I mentioned in the first post in this series goes into great detail about this. There have been many enhancements to block-oriented evaluation mode since then and updated details can be found in Performance Improvements for MDX in SQL Server 2008 Analysis Services.

Another counter that can indicate serious performance issues with MDX:

· MDX: Total recomputes

This indicates errors in calculations during query execution. A non-zero value here can indicate issues where unnecessary recalculations are taking place and can even lead to a fall back to cell by cell mode in an effort to eliminate the error.

Jeffrey Wang has a great blog post that walks though some examples of where recomputes can have a dramatic impact on performance.

One more set of counters that I think gives us a good look into MDX performance:

· MDX: Total NON EMPTY unoptimized

· MDX: Total NON EMPTY for calculated members

In a nutshell, SSAS data tends to be sparsely populated data. Take data on sales orders for an example. A typical sales order is not likely to include at least one of every item available for purchase. There are likely to be a lot of empty cells for items not purchased for any given order.

You’ll remember earlier I mentioned the performance impact of cell by cell calculations. When you have sparse data, specifying a NON EMPTY algorithm can also speed up performance, but there is actually more than one code path for this algorithm. A slower code path can actually cause performance degradation. This behavior has been improved in SSAS 2008, but can still occur. The above counters can help you identify when that’s the case. More details can be found in the NON_EMPTY_BEHAVIOR section of the Analysis Services 2008 R2 Performance Guide.

Improving MDX Performance

So to wrap up this edition of the series, what are some primary strategies for optimizing MDX performance? First, proper aggregations and partitions can make a big difference, especially if the bottleneck is in the storage engine. There is tons of information on aggregation strategy that I won’t repeat here, but to reuse our analogy, aggregation strategy is similar to SQL index strategy. There is a sweet spot. Too many aggregations can be just as bad as too few. Aggregations provide pre-calculated data that can improve query performance. However, aggregations must be maintained and too many will increase the resource requirements during processing as well as storage requirements for all this extra data.

If the bottleneck appears to be in the formula engine, watch for cell by cell calculation, and eliminate empty cells and tuples using the recommended strategies in the linked whitepapers mentioned above.

So at this point you might be wondering, aside from the counters we just mentioned, how do we know where the main bottle neck might be, formula engine or storage engine? Where should we focus our performance troubleshooting? I’ll help you further pinpoint those efforts in the next post.

Friday, January 13, 2012

Part 2: An Introduction to SSAS Performance and SQL Sentry Performance Advisor for Analysis Services

In the last post, I spoke about the fact that SQL Sentry Performance Advisor for Analysis Services (PA for SSAS) “lifts the veil” from the black box known as SQL Server Analysis Services. But, there is little information available that helps someone new to the technology easily interpret all this new information in order to optimize the performance of their servers. In an effort to provide some insight into this new source of information, I wrote about the SSAS engine and how it handles incoming requests. In part 2 of this series, I want to dig deeper into the common sources of bottlenecks and what information is available to troubleshoot these issues.

What can impact MDX performance?

For the purposes of this post, I’ll break down into six categories what can impact MDX performance on your SSAS server.

Those categories are:

· Processing

· Unoptimized code

· CPU

· Memory

· Disk

· Network

I’ll tackle each one by giving an explanation and refer to specific metrics, primarily performance counters, that can be used to analyze the given issue.

Processing

I mentioned in the last post that processing involves the creation and updating of cubes with new data and/or recalculating and reorganizing existing data. I related it to maintenance on SQL Server and many of the same factors you consider when you implement maintenance in SQL Server apply to processing in SSAS.

Processing primarily involves the Storage Engine (SE), which you’ll remember is multi-threaded, and can obviously be a resource intensive operation. Since processing may involve new data or recalculation of existing data, cached data will become outdated and will be flushed. This then means that the next time a query runs that requires that data, the SE will have to query the file system to retrieve needed data from the Windows cache or the disk system to replace the flushed data.  This will result in a performance hit to the query on its first run after processing. Querying while processing is occurring on the same cube can lead to blocks. Querying any cube while processing is occurring can lead to general resource contention on the server.

Monitoring Processing

image

So how can we tell when processing is occurring on the server? While there are many performance metrics that have varying levels of usefulness, I’ll mention a handful for each topic that will give you better insight for that topic. Don’t infer that if I don’t mention it, it’s not useful or vice versa. The point of this series is to give you a better understanding of what’s going on in SSAS and how to make better use of the information tools like SQL Sentry’s PA for SSAS provide.

· Processing: Rows written/sec tells you the rate of rows written during processing.

· Proc Aggregations: Rows created/sec more specifically tells you the rate aggregation rows are created during processing.

· Proc Indexes: Rows/sec tells the rate of rows from MOLAP files used to create indexes.

So, a non-zero value for these counters confirms that processing is taking place. These metrics can also help determine the effectiveness of any processing tuning efforts.  If your tuning efforts are effective, you should see an increase in these values accordingly.

Improving Processing Related Performance

So how do you improve performance related to processing? I mentioned earlier that you can relate SSAS processing with maintenance in SQL Server. Many of the same techniques applied in SQL Server maintenance apply here as well.

· Scheduling – If possible, schedule processing activities during time of least querying activity on the server to avoid blocks and resource contention.

· Processing Type – Use the most appropriate processing commands. Just as there are different types of backup strategies in SQL Server such as Full, Log, Differential, etc. the same applies in SSAS processing. For example, use ProcessData and ProcessIndex where appropriate instead of ProcessFull.

· Partitioning – If you’re working with Enterprise Edition, be sure to take advantage of the ability to break Measure Group data into partitions. Not every partition may need to be processed every time. Partitioning can serve to reduce the overall amount of work that is done. Remember that the SE is multi-threaded, so architect your data to take advantage of parallel processing capabilities.

For more information see the Analysis Services Processing Best Practices article in TechNet.

We’ve really just scratched the surface with the series.  In the next installment I’ll discuss identifying and understanding issues related to unoptimized MDX, and from there we’ll finish with server resource bottlenecks.  Stay tuned!

Tuesday, January 10, 2012

Part 1: An Introduction to SSAS Performance and SQL Sentry Performance Advisor for Analysis Services

Over the past couple years, since SQL Sentry introduced Performance Advisor for Analysis Services, I’ve heard many people say they see SSAS as a “black box” on their servers with no real visibility into troubleshooting performance issues. Performance Advisor (PA) lifts the shroud from Analysis Services, but for some it may not be obvious how to interpret all the information gained from this new visibility.

There are many resources out there that cover OLAP architecture and MDX, but much less in the way of an explanation to all the various metrics that are available and how they correlate with each other. In response, I developed a presentation I’ve given several times now at SQL Saturdays that I believe serves to “bridge the gap” for DBA’s that may be new to SSAS, or possibly even developers who lack a dedicated administrator and need to be able to tell where the bottleneck might be on their SSAS server.  I was also privileged to be invited to an interview on RunAs Radio where I spent some time discussing this with Richard Campbell and Greg Hughes.

Finally I’ve had the time to put much of this same advice down here. This is the first part of a couple posts I’ll dedicate to the topic. Anyone working with SSAS should find this useful, but users of PA for SSAS should find themselves even more proficient in troubleshooting SSAS performance issues. We’ll start with a very basic introduction to the service itself and how a query is handled, and move into specific metrics and identifying bottlenecks shortly.

Basic Architecture

It’s important to understand that SSAS is its own engine. The executable is msmdsrv.exe. It’s licensed with SQL Server, but in large production environments you aren’t likely to be running it on the same machine as a SQL Server. There are two primary types of activity that occur here; querying and processing.

Types of Activity

Querying can be done by various methods, but the most common type of queries are Multidimensional Expressions (MDX). This is just a standard language, similar to SQL, optimized for calculations. I won’t go into the how’s and why’s of MDX itself as there are many resources already available on its mastery. While this series won’t tell you how to write MDX, by the time we’re done you should be able to tell, with a few basic pieces of information described later, whether there is room for query optimization or if the bottleneck lies elsewhere.

Processing is essentially the maintenance activity of your cubes. It’s how cubes are created and updated with new data and/or how existing data is recalculated and reorganized. Just as with SQL Server, what level of processing you perform and when is essential to ensuring proper performance without interfering with peak querying times. We’ll dive into those options later.

The Formula Engine

Under the covers there are two primary “engines” within SSAS, the Formula Engine (FE) and the Storage Engine(SE). The FE accepts MDX requests and parses the query to process. It will send requests to the SE when data is needed. It then performs calculations on the retrieved data to provide a final result set. An interesting thing about the FE is that it is single threaded. If you are watching the CPU’s of your multi-core server with SSAS while a request is being handled and you only see one core spike, it’s probably because the FE was doing the work at that time.

The Storage Engine

The SE on the other hand is multi-threaded. The SE reads and writes data to and from the file system for the server. This can include working against the Windows file cache. This is an important distinction between Analysis Services and the relational engine in SQL Server. SQL Server’s relational engine has an internal cache, and then goes to disk for any other needed IO. It does not utilize the Windows file cache for database files. SSAS, on the other hand, does. This means that when SSAS is interacting with the file system on your server, it does not necessarily mean this will result in physical disk IO. SSAS utilizes its own caches as well as the file system cache to get what it needs.

SSAS Caches

So how does SSAS cache data? Each engine within SSAS has its own set of caches. The FE caches store flat values (the Flat Cache) as well as calculated data (the Calculation Cache). It’s important to understand that these caches are often scoped, which means the data might only be available for a limited time or only for a particular user, even in cache. The flat cache in particular is in fact restricted to a max of 10% of the TotalMemoryLimit property in SSAS, which we will discuss later.
So, that said, when we talk about SSAS cache optimization, warming, etc., we are more often dealing with the SE caches. I know that’s a bit of a simplification, but this is an introduction after all. ;)
So the SE consists of a Dimension Cache and a Measure Group Cache to store the data retrieved from the corresponding sources in the file system itself.

Anatomy of an MDX Query

image
So now that we understand the basic components under the covers in SSAS, how are they used to handle a request? First, the query is accepted and the FE parses it. If the appropriate data is already available in the proper scope in the FE cache, some or all may come from there. The FE then requests any other needed data from the SE. The SE then retrieves whatever it can from its caches. If still more data is needed, the SE queries the file system on the server. The data from the file system may, at least in part, come from the Windows file cache. Remember we mentioned this distinction earlier. Anything else results in physical disk IO. These steps may be iterated as needed while data is moved from SE caches to FE caches, etc., but you get the picture. Another graphical representation of this process can be found in Appendix A of the Microsoft Best Practices Article “Identifying and Resolving MDX Query Performance Bottlenecks in SQL server 2005 Analysis Services”, authored by Carl Rabeler and Eric Jacobsen. I highly recommend reviewing this article for a very in-depth look into this topic. Even though the title mentions SQL Server 2005, it’s still very relevant and useful.

What can impact MDX performance?

So now that we have a good understanding of how SSAS handles MDX requests, the question we need to answer is, “What can impact MDX performance, and how can we tell”? That is the question we will answer in the next post.

Monday, April 25, 2011

Wait Stats and The Rest of the Story

In my interactions with customers and evaluators of SQL Sentry, I occasionally come across DBA's who have a disproportionate focus on SQL Server wait stats as their primary source for performance analysis.

I understand the attractiveness of wait stats as a way to monitor and improve query performance.  However, I think it is very important to note that while wait stats are a great place to start when looking for performance bottlenecks, very seldom do they provide a complete picture.  This is why our intelligent wait stats analysis is part of, but not the foundation of, our product.  There are many wait stats that have no relevance to performance at all and can tend to clutter the picture.  Others aren’t obvious as to what specific resource or issue they are related, or can even be related to multiple issues at the same time.

“What am I waiting on?” vs. “Why am I waiting?”

Wait stats tend to suggest what a query or queries are waiting for, but not why they are waiting.  In order to get the complete picture, wait stats need to be correlated with other metrics.  This means that ultimately, methods or tools that focus almost exclusively on wait stats routinely require the use of additional tools and manual effort to get to the root cause of an issue and resolve it.  This tends to become a rather inefficient means of performance tuning.

A common example we’ve seen in several environments involves a server showing high disk IO waits.  If you were to only look at wait stats, seeing frequent high disk IO waits would most likely lead you to suspect a bottleneck with the disk system on the server.  Unfortunately, it is quite possible that simply looking at these waits could lead you down the wrong path.  This is because, commonly, the root cause of the performance bottleneck on the server was not the disk system at all, but memory pressure.

Taking a look at the bigger picture and examining other metrics such as Page Life Expectancy quickly revealed that memory pressure was resulting in excessive paging and thrashing of the disk system, which in turn resulted in high Disk IO waits.  Using SQL Sentry, we could easily correlate a spike in Disk IO waits with a drop in page life expectancy, cache hit ratios, and increased read or write paging activity.  In a case like that, the real bottleneck was occurring in memory and the disk performance issues seen in the wait stats were merely a symptom of the issue rather than the cause.

High parallelism waits are another common issue that often is not as simple to diagnose and resolve as it may first seem.  In fact, CXPACKET waits are often the highest wait type on many systems.  Paul Randal (blog|twitter) conducted a survey to this effect recently on his blog at http://www.sqlskills.com/BLOGS/PAUL/post/Survey-what-is-the-highest-wait-on-your-system.aspx

First you need to know that CXPACKET waits refer to parallelism.  This is only one of many wait stats that aren’t readily obvious what they refer to by name alone.  Next, parallelism waits are not always so directly attributed to CPU bottlenecks as you might think.  Simply setting your MAXDOP may or may not improve the performance of the query running at the time of the wait stats collection, but that may still only be treating a symptom instead of the root cause.  I’ve seen that other resource pressure like disk or memory can actually lead to increased parallelism waits too.  Again, without a complete picture of your server’s performance and activity like SQL Sentry provides, you have no choice but to resort to additional tools and manual effort to track down and resolve the issue effectively.

Query Level Optimization

Common root causes of parallelism waits can be missing indexes, outdated statistics and generally inefficient queries.  Examining wait stats at the query level may tell you what the query was waiting on, but it still doesn’t tell you why.  I have found that execution plan analysis (one of the new features in SQL Sentry Version 6) is a much more effective means of investigating query performance than query-level wait statistics.  Our Plan Analysis features not only collect execution plans for long-running and poorly-performing queries, but also provide greatly expanded visibility into the plans themselves.  Since complex query plans can become extremely difficult to diagnose, having an effective tool to analyze them can be as important as having the plan in the first place! 

With SQL Sentry, I can use the combination of performance counter data, wait stats, and top SQL collection to identify my worst performing queries and resource bottlenecks, then dig deeper using Plan Analysis to resolve those issues as discussed in Product Manager Greg Gonzalez’s (blog|twitter) blog post at http://greg.blogs.sqlsentry.net/2011/02/sql-sentry-v6-plan-analysis-features.html.  All of this from one tool!

Hopefully this has served to shed some light on the pros and cons of wait stats analysis, and the thinking behind many of the features SQL Sentry provides.  Ultimately, everything we do is designed to make all our lives easier as SQL Server DBA's and developers.