Primer on Monitoring 2.0 + CPU Ready

Certainly, we’ve come a long ways with performance monitoring in the last few years.  New technologies like intelligent baselining and predictive analytics are getting more accurate, but not even 50% of the battle.  As you read through blogs and look at all of the tools out there, most continue to look at “Metric X” or “Property Y” is unhealthy for Z reason.  What people want need is something that can surface the real importance of what the problem is, the context of why it is problem and what they should do to fix it.

Take CPU %Ready, by itself can tell us only a little about what is actually happening to a VM at a given point.  If %Ready is over X% or Y ms, then Alert!  Why is it high?  You have to go hunt and peck for the reasons or at least maybe they point you in a direction.  If you are monitoring the systems, you HAVE the data.  Use that data, make it more valuable with additional context.  Below is my basic logic of where we need to start going in performance monitoring for %Ready.

  • Problem – CPU %Ready is High
    • Does the VM have a Limit?  Does it need to have a limit
      • Recommend removing the limit
    • Does the use all of it’s resources? It is running at 10% or 100% over a long period of time and right now?
      • Recommend based on utilization over time vs. allocation to reduce the cpu count allocated to the VM
    • Is it even this VM?
      • A lot of times it can be another VM or set of VMs on the same host that have too many vCPUs or high utilization and that in turn affects the VM.
      • Look for other misbehaving VMs and show those.
    • vCPU to Core ratio?
      • If the ratio is high, look for another host with a lower ratio and recommend moving the VM there.
    • What’s waiting?
      • Let’s look inside the guest and see what is requesting those resources and show the top N processes, they want the resource so it would be nice to see them.

So with all of those things, it is an AND discussion.  If a VM has a limit and it is on a host with a high vCPU to core ratio AND there are other misbehaving VMs, we need to surface the importance of those added data points.

What is the answer? Show all of that in an alarm and then make it actionable (or even allow them to workflow right from there)

Problem – VM X has high %Ready because it has a cpu limit set of X Mhz.  VM Y (with 16 vCPUS and 95% utilization) and VM Z (8 vCPUS and 100% utilization) combined with a vCPU to core ratio of 8.4:1 are impacting your %Ready on VM X.

Solution – If possible, remove the cpu limit on VM X.  Additionally, move VM X to Host X.  VM Y and VM Z are actively using most of the resources and the vCPU to core ration is high (8.4:1) vs. another host in the cluster

Simple, concise, looks across the environment to surface what is important, why it is important and what we should do.

Advertisements

Performance Monitoring 2.0 – Let’s go!

I’ve been trying to start back blogging, and it took a great set of bloggers at VMware PEX this year to finally kick it off.  Their advice? Stick to what you know, build a little bit at a time and be informative.  So here we go.

I’ve been lucky enough to be in virtualization for about 10 years now, ESX 1.5 days for those keeping track, and before virtualization I was an IT administrator like most of you out there.  We did performance monitoring for our company with traditional tools like BMC, MOM/SCOM, perfmon/top and generic scripts to get what we needed.

The funny thing is, 10 years later and there are 100s of new tools, yet we still monitor performance mostly in the same way.  We present a layer of data to the user and they are left to make interpret that data and then perform an action.  Even today, with all our technology, performance monitoring is behind.  Why is 5% CPU ready bad?  What does it really mean and what should I do about it?  Do I even care about the 1000s of VMs that are healthy?

It’s long overdue that we change the way system administrators monitor performance.  My hope is to share with you my thought process on specific issues, surfacing the important data only, back that up with WHY it is bad and actually SHOW you the things that cause the pain and later on present PowerShell or other scripts that perform all of this.