Primer on Monitoring 2.0 + CPU Ready

Certainly, we’ve come a long ways with performance monitoring in the last few years.  New technologies like intelligent baselining and predictive analytics are getting more accurate, but not even 50% of the battle.  As you read through blogs and look at all of the tools out there, most continue to look at “Metric X” or “Property Y” is unhealthy for Z reason.  What people want need is something that can surface the real importance of what the problem is, the context of why it is problem and what they should do to fix it.

Take CPU %Ready, by itself can tell us only a little about what is actually happening to a VM at a given point.  If %Ready is over X% or Y ms, then Alert!  Why is it high?  You have to go hunt and peck for the reasons or at least maybe they point you in a direction.  If you are monitoring the systems, you HAVE the data.  Use that data, make it more valuable with additional context.  Below is my basic logic of where we need to start going in performance monitoring for %Ready.

  • Problem – CPU %Ready is High
    • Does the VM have a Limit?  Does it need to have a limit
      • Recommend removing the limit
    • Does the use all of it’s resources? It is running at 10% or 100% over a long period of time and right now?
      • Recommend based on utilization over time vs. allocation to reduce the cpu count allocated to the VM
    • Is it even this VM?
      • A lot of times it can be another VM or set of VMs on the same host that have too many vCPUs or high utilization and that in turn affects the VM.
      • Look for other misbehaving VMs and show those.
    • vCPU to Core ratio?
      • If the ratio is high, look for another host with a lower ratio and recommend moving the VM there.
    • What’s waiting?
      • Let’s look inside the guest and see what is requesting those resources and show the top N processes, they want the resource so it would be nice to see them.

So with all of those things, it is an AND discussion.  If a VM has a limit and it is on a host with a high vCPU to core ratio AND there are other misbehaving VMs, we need to surface the importance of those added data points.

What is the answer? Show all of that in an alarm and then make it actionable (or even allow them to workflow right from there)

Problem – VM X has high %Ready because it has a cpu limit set of X Mhz.  VM Y (with 16 vCPUS and 95% utilization) and VM Z (8 vCPUS and 100% utilization) combined with a vCPU to core ratio of 8.4:1 are impacting your %Ready on VM X.

Solution – If possible, remove the cpu limit on VM X.  Additionally, move VM X to Host X.  VM Y and VM Z are actively using most of the resources and the vCPU to core ration is high (8.4:1) vs. another host in the cluster

Simple, concise, looks across the environment to surface what is important, why it is important and what we should do.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s