Wednesday, February 1, 2012

Part 5: An Introduction to SSAS Performance and SQL Sentry Performance Advisor for Analysis Services

Up to this point in the series we’ve been focusing on common activity in Analysis Services, how to identify the bottleneck for that activity, and how to resolve the bottleneck. In the last post we specifically discussed how to identify if the Storage Engine (SE) or Formula Engine (FE) was the larger bottleneck. Now I want to talk more about identifying bottlenecks with the server’s physical resources themselves.
The four basic areas to investigate when it comes to server resources are:
· CPU
· Memory
· Disk
· Network
A lot of these issues, and methods for identifying them, are common to Windows server performance in general, but I’ll include pertinent SSAS specific performance details and metrics too.

CPU
With regards to any Windows server, there are a handful of counters that provide good indication that the bottleneck with your server may be related to the CPU(s).
· Processor: % Processor Time
· System: Context Switches/sec
· System: Processor Queue Length
There is already tons of information available online with regards to general processor monitoring and optimization so I won’t say much more here. While there is no magic number that indicates what “good” CPU utilization is, you generally want to get the most utilization without overburdening the system, so look for sustained periods of 100% utilization on one or all cores to suggest a bottleneck here.
The other two counters, when abnormally high for sustained periods, usually suggest an issue with too many parallel operations on the server.
So let’s take a look at some Analysis Services performance metrics that shed some light on processor utilization. There are two sets of SSAS performance counters that apply to the formula and storage engines.
· Threads: Query pool
· Threads: Processing pool
Don’t let the names confuse you. The query pool refers to FE activity. The processing pool not only refers to processing, but any SE activity. Each of these categories contains counters for Busy, Idle, Job Rate, and Queue Length. They allow you to see the thread activity for each engine. If you are seeing consistently high queue lengths, but not high CPU utilization you many want to adjust your MaxThreads and/or CoordinatorExecutionMode properties for your SSAS instance. More details on these settings are covered in section 6.11 of the SSAS 2008 Performance Guide, as well as tip 8 of the SQLCAT Analysis Services Query Performance Top 10 Best Practices.
Remember that the FE is single threaded, so increasing the query pool setting may not improve performance of any one query, but may improve the performance in handling multiple simultaneous requests.

Memory
There are three groups of metrics when it comes to monitoring Analysis Services memory.

· Overall usage – how much total memory is SSAS using on the server.
· Cache hit ratios – how efficient is the use of that memory.
· Cache activity – what is happening to the memory.

SSAS Memory Usage

· Memory: Memory Usage KB
This is the total memory usage for the server process, and should be the same as the Process: Private Bytes counter for msmdsrv.exe.

NOTE: Do NOT rely on Task Manager for an accurate picture of memory usage.

· Memory: Cleaner Memory KB
· Memory: Cleaner Memory shrinkable KB
· Memory: Cleaner Memory nonshrinkable KB
These counters refer to the background cleaner for SSAS. The first counter refers to the amount of memory known to the background cleaner. That memory is then divided into shrinkable and nonshrinkable memory. This describes what portion of that known memory is subject to purging by the cleaner based on memory limits. The cleaner value is likely to be a bit lower than the total usage value, but it’s important to know because this lets you know how much room you have to actually work with when it comes to memory management. The limits the cleaner works with are defined by properties indicated by the following two counters.

· Memory: Memory Limit Low KB
· Memory: Memory Limit High KB
A great explanation of these properties and counters, along with real world examples of their use is covered in Greg Gonzalez’ (b|t) blog post, Analysis Services Memory Limits.

SSAS Cache Hit Ratios
Remember from previous posts that the FE and SE each have caches. There is the Calculation and Flat caches for the FE, and Dimension and Measure Group caches for the SE.
The counters that allow you to determine cache efficiency are all in the Storage Engine Query category.

· Calculation cache lookups/sec, hits/sec
· Flat cache lookups/sec, hits/sec
· Dimension cache lookups/sec, hits/sec
· Measure group cache lookups/sec, hits/sec
While there is no persistent cache hit ratio counter itself for these caches as there is for SQL Server, these metrics will allow you to calculate the ratio for each cache for a given point in time.

SSAS Cache Activity
One last group of counters to consider relate to overall cache activity.

· Cache: Inserts/sec
· Cache: Evictions/sec
· Cache: KB added/sec
· Memory: Cleaner: Memory shrunk KB/sec
These metrics give a pretty direct indicator of memory pressure on the server. If the Evictions/sec and/or Cleaner Memory shrunk KB/sec are consistently non-zero, you likely have memory pressure on the server. The Cleaner: Memory shrunk counter in particular indicates that you are exceeding your defined memory limits described earlier.

How do I improve cache usage?
In addition to improperly configured memory limits as discussed in Greg Gonzalez’ blog referenced earlier, another common Analysis Services memory related performance issue is a cold cache. You’ll remember in Part 2 of this series we discussed cube processing and the fact that data in cache becomes invalidated and flushed after processing occurs. This means that the next time a query is executed, no data will be in cache, resulting in queries to the file system, and a dramatic hit to that query’s performance. A common scenario involves nightly processing that leaves the cache cold in the morning. The first person in the office, which often times is someone high up the management chain, runs an important report that takes forever to run. The point here is that some of the people you least want to experience performance issues may be the ones most likely to see them under this scenario. So what do you do?
The answer is cache warming. There are different ways to warm the cache. The simplest is to run a couple of the most commonly used queries after processing to pull the most likely needed data into cache. A more in depth description of the process is covered in the Identifying and Resolving MDX Query Performance Bottlenecks whitepaper I mentioned in part 1, as well as an outstanding blog post by Chris Webb (b|t) on Building a Better Cache Warmer.

Disk
In order to determine if the disk system is a bottleneck for your SSAS instance, you need to first verify that Analysis Services is indeed accessing the disk system. I mentioned in the first post that the SE accesses the file system. So check the previously mentioned counters to verify the SE is active. A more specific counter is:

· MSAS: Storage Engine Query: Queries from file/sec
Now remember, unlike the relational engine, SSAS data may be in the Windows file cache. This means that even when the above counter is non-zero, it alone does not guarantee physical disk IO. You’ll want to examine the following three counters to get an idea of how much of the SE activity is actually reading from disk as opposed to the Windows file cache.

· MSAS: Storage Engine Query: Data bytes/sec
· Physical Disk: Disk Read Bytes/sec
· Cache: Copy Reads/sec
Be sure to account for activity outside of SSAS when using these metrics. For a much more detailed explanation with great screen shots of PA for SSAS, see Greg Gonzalez’ blog post on the subject.
So once you’ve determined Analysis Services is incurring physical disk IO, what should you check to ensure the disks are performing optimally? There are many different counters available for disk performance and some are more useful than others. For a long time, disk queue length was considered an important metric, but as server storage grew to include more and more spindles, was moved to SAN’s, or included SSD’s, this metric has become less meaningful. A more universal indicator for disk performance is latency.

· Physical Disk: Avg. Disk sec/Read
· Physical Disk: Avg. Disk sec/Write
Optimally these should remain below 10 ms. As you approach 20 to 30 ms or more, you’re going to notice performance issues related to the disk system. This isn’t specific to SSAS, but more of a general server guideline. There is tons of material available online that focus on disk performance.

How do I improve Disk Performance?
Partitions are essential to optimizing disk performance. Partitions based on the way the data is most likely queried, such as by timeframes, will help reduce the amount of data that must be retrieved from disk for a given query. It can also help you take advantage of parallelism when multiple queries are submitted, as well as when processing. Remember to distribute your partitions properly to spread the load across multiple disks.
You may also want to disable Flight Recorder. Without going into the pros and cons of why you do or don’t want Flight Recorder, many articles suggest a performance improvement by disabling it. It is basically a file based trace on your system and will increase IO. It can be disabled in the properties for the SSAS instance.
There are a couple other things worth mentioning not specific to SSAS. First note the location of your cube’s files. If they are sharing spindles with system files, or other busy databases, you’re likely to run into contention. The same thing applies to SAN allocation. This is often harder to investigate without the help of your SAN administrator, but be sure you’re not sharing busy spindles on the SAN either.
Finally, watch for partition misalignment issues that can have a significant impact. In Performance Advisor’s Disk Activity view, partitions highlighted in red are misaligned. Review the Disk Partition Alignment Best Practices for SQL Server whitepaper by the SQLCAT team for more details.

Network
Just as with the relational engine, network is the component that probably offers the least visibility. The problem is just as likely to be outside your server. The network itself can be slow, or the bottleneck may be on the client end. That said, there are some metrics to identify if the problem is a local network issue.
With any windows server take a look at:
· Network Interface: Bytes Received/sec
· Network Interface: Bytes Sent/sec
· Network Interface: Output Queue Length
This will give you an idea of the traffic on the NIC(s), and let you know if there is a backup in output.
For SSAS we can at least identify what kind of traffic we are sending through the pipe:

· Processing: Rows read/sec – tells the rate of rows read from all relational DB’s.
· Storage Engine Query: Rows sent/sec – tell the rate of rows sent from the server to clients.
This should give you a better visibility into how your network cards are performing and how much of that is related to Analysis Services activity.

So what now?
Hopefully by now, if you’ve read the entire series, you have a much better understanding of how SSAS works under the covers, and how to identify an SSAS performance bottleneck. At this point you should be able to make good use of the various whitepapers and blog postings that I’ve referenced throughout. If you’re already a SQL Sentry Performance Advisor for Analysis Services user, you should find this has served to jump start your use of the product to quickly interpret what is being provided and optimize your SSAS performance. If you haven’t yet tried the product, what are you waiting for? Download an evaluation license of the entire BI Suite and let me know what you think.