Community matters

I’ve had the pleasure of working with VMware products for the last 10 years now, and over that time I’ve changed jobs more times than I can easily count. Throughout all of the change, one thing that has remained constant is the wonderful people in the community that VMware has around itself.

I’ve seen my own contributions ebb and flow throughout my time in the community. But just being part of the community makes you part of a special family. I use the word family, because that is what it truly feels like.

I haven’t contributed a lot lately, but I’ve been in Silicon Valley during the last two weeks and it amazes me how being thousands of miles from home you can bond with folks over something common like virtualization and become friends. Talking to folks on Twitter and a organizing dinner or attending the SV VMUG yesterday, it was like I never left Austin or haven’t blogged/tweeted in awhile.

While not a technical post, I think we need to take a step back and just enjoy the moments we have with each other and the relationships we form. Individual companies/products may come and go, but this community and the relationships we build will always be here and for that I want to say to everyone reading this.

Thank you.

Thomas

Advertisements

Behavioral Analytics – Food for thought

One of the more recent hot topics in the performance analysis/monitoring space as of late has been the concept of behavioral analytics, learning algorithms or baselining.  The concept itself is quite simple – Look for patterns in data sets and as the data set gets larger, the algorithms can get closer and closer to predicting the behavior over time or through correlation through multiple metrics/data points.

Take internet usage for example at your company.  The most heavy times of internet traffic tend to occur when folks arrive and log into the network in the morning, around lunch time and at the end of day.  This sort of repeatable pattern can be analyze and over time become a baseline for expect behavior of the metric(s) for internet usage.  In this way you don’t have to understand the behavior and set the more traditional threshold of something like ‘Internet Usage over X’ is abnormal so alert on it.  This is an over simplified version of what happens, but it gets the point across

No single person, or team of people, can be expected to truly understand the expected behavior of all of the workloads running, let alone how they would change over time.  When you start to scale up to larger environments, these sort of behavioral analytics are crucial to the enterprise.  For that reason large enterprises are increasingly looking for tools that can help them understand when they really have pain vs. noise of traditional thresholding methods.

But… there is a downside to learning algorithms that is hard to program around.  Suppose your current environment you have 100ms of latency to your storage devices.  This is obviously not good, but if you plug it into a learning algorithm, it will learn that the behavior might be normal.  Yet you know that it is not good, the algorithms would come to expect that and you get the inverse, when it deviates away from that expected behavior you could get false positives.  In that way, a metric like latency is not a good fit for a learning algorithm by itself.  It needs to look at multiple metrics to try to correlate behavior or use a traditional threshold base value where ‘Latency over Y’ is bad.

Just some food for though.

Disk Space – What matters?

One of the more common things people ask me in monitoring is how can I accurately know about when I am going to run out of disk space.  The common method is to look at remaining capacity to show 80, 90, 95% or something to that affect.  What if the drive is 2TB?  Even at 95% full, that means there is ~102G free.  So would I really want want to know that it is low on space?  But would I want to just know only on size?  What about growth rates?  If I have a server that is normally full, but it’s not growing would I even want to know that it’s % or size is full?  What action would I take?

So what should you care about?  How can I reliably tell when I am running out of drive space?

The simple answer is growth rate.  If it is growing, when will it reach full?  This has to be looked at long term AND short term in something like a moving average so we don’t get too much noise, but also we want to know if it starts filling up quickly and therefore a balance has to be had on two different growth rates (short/long term imo)

Beyond growth rate, you still want to allow for the more traditional gates.  If the host/vm/drive was just built it will go from 0 to 25% or something and that skews data.  So you have to balance the freshness of the data with the traditional 80/90/95% and the MB/GB remaining that you are comfortable with.

What I propose is something like this:

  • Check the age of the object – If it is new, calculate growth, but don’t use it until we have enough datapoints.
  • Calculate growth rates for daily growth rate (many 2-10 min chunks) and that becomes part of the overall weekly growth rate (many hourly chunks).  You can then even take the weekly rate over time and look at that growth rate for longer term trending.
  • If we are established and we have high growth, tell me when I get 4 hours, 2 days and 5 days out.
    • Anything else and it’s noise, why would I care that I’m a month from running out of space?  I want to know prior to the weekend that I could have issues next week (5 days).  I am busy, so tell me that I have a day or two to deal with it (2 days).  Ok it’s still growing and getting close, time to expand the drive, delete stuff etc… (4 hours)
  • If we don’t have a reliable set of growth rate data, fall back to space free and/or percentage based.  Set reasonable gates, 90/95/97% and/or 20G, 5G, 1G.  Clearly this part is more about you knowing your environment, because one size doesn’t fit all.
    • Right – Your D: drive is 97% consumed and only has 300M free.
    • Wrong – Your D: drive is 97% consumed and you have 200G free.  (Really?)

In the end, growth rate should be what we want to focus on, as it tells me what I really want to know – You are going to run out of space in X hours, so do something! When we don’t have enough data, we can’t ignore drive space, we just fall back to the traditional methods.  In the end this becomes actionable data and we all already know what you do on drive space running low.

So now what?  If you stop and look at the file system, we could do the same thing here to look at what files are being touched and growing, what is new on the drive that is filling it up and even point to common file types that could be removed to reduce the size of the drive.  That can get a bit more complicated, but again we want actionable information and why have to hunt and peck for what files are top consumers, do I have 10k temp files that are pretty safe to delete or even what directories are the ones that are doing the growth?  This is all things monitoring tools can do today, so why not surface that when you tell me the alert.

“Your C: drive is currently at 500M free and will fill up in 3.5 hours.  The fastest growing folder is C:WindowsTemp @ 140M/hr and the largest folders are C:Program FilesSomethingat 40G and C:AppCustomdatabase at 27G”

Something like that is pretty powerful.  I know instantly what is wrong, what things are likely causing it to be a problem.  I can now go add space or look at the FS to why things are growing.

Primer on Monitoring 2.0 + CPU Ready

Certainly, we’ve come a long ways with performance monitoring in the last few years.  New technologies like intelligent baselining and predictive analytics are getting more accurate, but not even 50% of the battle.  As you read through blogs and look at all of the tools out there, most continue to look at “Metric X” or “Property Y” is unhealthy for Z reason.  What people want need is something that can surface the real importance of what the problem is, the context of why it is problem and what they should do to fix it.

Take CPU %Ready, by itself can tell us only a little about what is actually happening to a VM at a given point.  If %Ready is over X% or Y ms, then Alert!  Why is it high?  You have to go hunt and peck for the reasons or at least maybe they point you in a direction.  If you are monitoring the systems, you HAVE the data.  Use that data, make it more valuable with additional context.  Below is my basic logic of where we need to start going in performance monitoring for %Ready.

  • Problem – CPU %Ready is High
    • Does the VM have a Limit?  Does it need to have a limit
      • Recommend removing the limit
    • Does the use all of it’s resources? It is running at 10% or 100% over a long period of time and right now?
      • Recommend based on utilization over time vs. allocation to reduce the cpu count allocated to the VM
    • Is it even this VM?
      • A lot of times it can be another VM or set of VMs on the same host that have too many vCPUs or high utilization and that in turn affects the VM.
      • Look for other misbehaving VMs and show those.
    • vCPU to Core ratio?
      • If the ratio is high, look for another host with a lower ratio and recommend moving the VM there.
    • What’s waiting?
      • Let’s look inside the guest and see what is requesting those resources and show the top N processes, they want the resource so it would be nice to see them.

So with all of those things, it is an AND discussion.  If a VM has a limit and it is on a host with a high vCPU to core ratio AND there are other misbehaving VMs, we need to surface the importance of those added data points.

What is the answer? Show all of that in an alarm and then make it actionable (or even allow them to workflow right from there)

Problem – VM X has high %Ready because it has a cpu limit set of X Mhz.  VM Y (with 16 vCPUS and 95% utilization) and VM Z (8 vCPUS and 100% utilization) combined with a vCPU to core ratio of 8.4:1 are impacting your %Ready on VM X.

Solution – If possible, remove the cpu limit on VM X.  Additionally, move VM X to Host X.  VM Y and VM Z are actively using most of the resources and the vCPU to core ration is high (8.4:1) vs. another host in the cluster

Simple, concise, looks across the environment to surface what is important, why it is important and what we should do.

Performance Monitoring 2.0 – Let’s go!

I’ve been trying to start back blogging, and it took a great set of bloggers at VMware PEX this year to finally kick it off.  Their advice? Stick to what you know, build a little bit at a time and be informative.  So here we go.

I’ve been lucky enough to be in virtualization for about 10 years now, ESX 1.5 days for those keeping track, and before virtualization I was an IT administrator like most of you out there.  We did performance monitoring for our company with traditional tools like BMC, MOM/SCOM, perfmon/top and generic scripts to get what we needed.

The funny thing is, 10 years later and there are 100s of new tools, yet we still monitor performance mostly in the same way.  We present a layer of data to the user and they are left to make interpret that data and then perform an action.  Even today, with all our technology, performance monitoring is behind.  Why is 5% CPU ready bad?  What does it really mean and what should I do about it?  Do I even care about the 1000s of VMs that are healthy?

It’s long overdue that we change the way system administrators monitor performance.  My hope is to share with you my thought process on specific issues, surfacing the important data only, back that up with WHY it is bad and actually SHOW you the things that cause the pain and later on present PowerShell or other scripts that perform all of this.

VMworld – What to bring?

I’ve been attending VMworld since 2006, and over the past 6 years I’ve learned a lot about what to pack and how to best organize my week.  So I thought I would share some of those things with you.

Obviously, you should bring your laptop, phone, clothes etc… but what else should you bring?  Firstly, bring a spare battery pack for your phone/tablet!  When you are running around Mascone there are lots of dead/weak signal spots and your phone will be searching for signal and eating up a ton of power.  On top of that, if you are using Twitter, Facebook or others you can count on half a day or less of phone life.  Personally, I have a little myCharge Power Bank 3000, which has a USB connectors and iPhone connectors so I can share it out if a friend is running low or power my own.

If you have one, a powered USB Hub is a life saver and will make you 1000 friends at times.  I always bring one to share.

Along that same vien, I would bring small pre-organized cables, aka. cable ties.  There is nothing worse than having all of your phone cables, power packs etc… tangled in a mess.  I bring a bunch of cables all wire wrapped so they are big enough to plug into my laptop and give me 6 – 12″ of length.  Also a few extra cable ties in case I need to break one or someone needs one.

Next, bring comfortable shoes… 2 – 3 pairs.  Maybe it’s common sense, but you should always rotate shoes one pair every 2 – 3 days max anyways, but make sure you bring several comfortable pairs that are NOT new.  Now is the time to break in new shoes, not at VMworld when you will get blisters if they don’t fit right.  Add to that the cost of getting a different, non-blistering, pair in SF… ouch!  Just bring a few, I tell folks this every year.

Food!  Did I mention food?  VMware does a great job of feeding folks, even if it is boxed food 😉  I always recommend to bring or buy a box of granola bars and take a few with you each day.  You are going to be doing a lot of walking in those shoes of yours and a lot of talking, so having something readily available can be a life saver at times.  Also, it will keep you from crashing to early if you want to go to all of those late night parties.

Each year, VMware tends to give you a backpack, so you don’t need to bring one to the show.  However… I hate lugging that around and frankly they tend to be large.  My wife hates me for it, but having a very small messager bag (ok, mine is a Murse) that can hold my phone, tablet, charging cables and food is all I need.  Less is more when you are running around for a week.  Also, it’s 100x easier to spot my bag vs. which of the 20k VMworld back packs is mine.