One of the more recent hot topics in the performance analysis/monitoring space as of late has been the concept of behavioral analytics, learning algorithms or baselining. The concept itself is quite simple – Look for patterns in data sets and as the data set gets larger, the algorithms can get closer and closer to predicting the behavior over time or through correlation through multiple metrics/data points.
Take internet usage for example at your company. The most heavy times of internet traffic tend to occur when folks arrive and log into the network in the morning, around lunch time and at the end of day. This sort of repeatable pattern can be analyze and over time become a baseline for expect behavior of the metric(s) for internet usage. In this way you don’t have to understand the behavior and set the more traditional threshold of something like ‘Internet Usage over X’ is abnormal so alert on it. This is an over simplified version of what happens, but it gets the point across
No single person, or team of people, can be expected to truly understand the expected behavior of all of the workloads running, let alone how they would change over time. When you start to scale up to larger environments, these sort of behavioral analytics are crucial to the enterprise. For that reason large enterprises are increasingly looking for tools that can help them understand when they really have pain vs. noise of traditional thresholding methods.
But… there is a downside to learning algorithms that is hard to program around. Suppose your current environment you have 100ms of latency to your storage devices. This is obviously not good, but if you plug it into a learning algorithm, it will learn that the behavior might be normal. Yet you know that it is not good, the algorithms would come to expect that and you get the inverse, when it deviates away from that expected behavior you could get false positives. In that way, a metric like latency is not a good fit for a learning algorithm by itself. It needs to look at multiple metrics to try to correlate behavior or use a traditional threshold base value where ‘Latency over Y’ is bad.
Just some food for though.
Certainly, we’ve come a long ways with performance monitoring in the last few years. New technologies like intelligent baselining and predictive analytics are getting more accurate, but not even 50% of the battle. As you read through blogs and look at all of the tools out there, most continue to look at “Metric X” or “Property Y” is unhealthy for Z reason. What people
want need is something that can surface the real importance of what the problem is, the context of why it is problem and what they should do to fix it.
Take CPU %Ready, by itself can tell us only a little about what is actually happening to a VM at a given point. If %Ready is over X% or Y ms, then Alert! Why is it high? You have to go hunt and peck for the reasons or at least maybe they point you in a direction. If you are monitoring the systems, you HAVE the data. Use that data, make it more valuable with additional context. Below is my basic logic of where we need to start going in performance monitoring for %Ready.
- Problem – CPU %Ready is High
- Does the VM have a Limit? Does it need to have a limit
- Recommend removing the limit
- Does the use all of it’s resources? It is running at 10% or 100% over a long period of time and right now?
- Recommend based on utilization over time vs. allocation to reduce the cpu count allocated to the VM
- Is it even this VM?
- A lot of times it can be another VM or set of VMs on the same host that have too many vCPUs or high utilization and that in turn affects the VM.
- Look for other misbehaving VMs and show those.
- vCPU to Core ratio?
- If the ratio is high, look for another host with a lower ratio and recommend moving the VM there.
- What’s waiting?
- Let’s look inside the guest and see what is requesting those resources and show the top N processes, they want the resource so it would be nice to see them.
So with all of those things, it is an AND discussion. If a VM has a limit and it is on a host with a high vCPU to core ratio AND there are other misbehaving VMs, we need to surface the importance of those added data points.
What is the answer? Show all of that in an alarm and then make it actionable (or even allow them to workflow right from there)
Problem – VM X has high %Ready because it has a cpu limit set of X Mhz. VM Y (with 16 vCPUS and 95% utilization) and VM Z (8 vCPUS and 100% utilization) combined with a vCPU to core ratio of 8.4:1 are impacting your %Ready on VM X.
Solution – If possible, remove the cpu limit on VM X. Additionally, move VM X to Host X. VM Y and VM Z are actively using most of the resources and the vCPU to core ration is high (8.4:1) vs. another host in the cluster
Simple, concise, looks across the environment to surface what is important, why it is important and what we should do.