Disk Space – What matters?

One of the more common things people ask me in monitoring is how can I accurately know about when I am going to run out of disk space.  The common method is to look at remaining capacity to show 80, 90, 95% or something to that affect.  What if the drive is 2TB?  Even at 95% full, that means there is ~102G free.  So would I really want want to know that it is low on space?  But would I want to just know only on size?  What about growth rates?  If I have a server that is normally full, but it’s not growing would I even want to know that it’s % or size is full?  What action would I take?

So what should you care about?  How can I reliably tell when I am running out of drive space?

The simple answer is growth rate.  If it is growing, when will it reach full?  This has to be looked at long term AND short term in something like a moving average so we don’t get too much noise, but also we want to know if it starts filling up quickly and therefore a balance has to be had on two different growth rates (short/long term imo)

Beyond growth rate, you still want to allow for the more traditional gates.  If the host/vm/drive was just built it will go from 0 to 25% or something and that skews data.  So you have to balance the freshness of the data with the traditional 80/90/95% and the MB/GB remaining that you are comfortable with.

What I propose is something like this:

  • Check the age of the object – If it is new, calculate growth, but don’t use it until we have enough datapoints.
  • Calculate growth rates for daily growth rate (many 2-10 min chunks) and that becomes part of the overall weekly growth rate (many hourly chunks).  You can then even take the weekly rate over time and look at that growth rate for longer term trending.
  • If we are established and we have high growth, tell me when I get 4 hours, 2 days and 5 days out.
    • Anything else and it’s noise, why would I care that I’m a month from running out of space?  I want to know prior to the weekend that I could have issues next week (5 days).  I am busy, so tell me that I have a day or two to deal with it (2 days).  Ok it’s still growing and getting close, time to expand the drive, delete stuff etc… (4 hours)
  • If we don’t have a reliable set of growth rate data, fall back to space free and/or percentage based.  Set reasonable gates, 90/95/97% and/or 20G, 5G, 1G.  Clearly this part is more about you knowing your environment, because one size doesn’t fit all.
    • Right – Your D: drive is 97% consumed and only has 300M free.
    • Wrong – Your D: drive is 97% consumed and you have 200G free.  (Really?)

In the end, growth rate should be what we want to focus on, as it tells me what I really want to know – You are going to run out of space in X hours, so do something! When we don’t have enough data, we can’t ignore drive space, we just fall back to the traditional methods.  In the end this becomes actionable data and we all already know what you do on drive space running low.

So now what?  If you stop and look at the file system, we could do the same thing here to look at what files are being touched and growing, what is new on the drive that is filling it up and even point to common file types that could be removed to reduce the size of the drive.  That can get a bit more complicated, but again we want actionable information and why have to hunt and peck for what files are top consumers, do I have 10k temp files that are pretty safe to delete or even what directories are the ones that are doing the growth?  This is all things monitoring tools can do today, so why not surface that when you tell me the alert.

“Your C: drive is currently at 500M free and will fill up in 3.5 hours.  The fastest growing folder is C:WindowsTemp @ 140M/hr and the largest folders are C:Program FilesSomethingat 40G and C:AppCustomdatabase at 27G”

Something like that is pretty powerful.  I know instantly what is wrong, what things are likely causing it to be a problem.  I can now go add space or look at the FS to why things are growing.