One of the more recent hot topics in the performance analysis/monitoring space as of late has been the concept of behavioral analytics, learning algorithms or baselining. The concept itself is quite simple – Look for patterns in data sets and as the data set gets larger, the algorithms can get closer and closer to predicting the behavior over time or through correlation through multiple metrics/data points.
Take internet usage for example at your company. The most heavy times of internet traffic tend to occur when folks arrive and log into the network in the morning, around lunch time and at the end of day. This sort of repeatable pattern can be analyze and over time become a baseline for expect behavior of the metric(s) for internet usage. In this way you don’t have to understand the behavior and set the more traditional threshold of something like ‘Internet Usage over X’ is abnormal so alert on it. This is an over simplified version of what happens, but it gets the point across
No single person, or team of people, can be expected to truly understand the expected behavior of all of the workloads running, let alone how they would change over time. When you start to scale up to larger environments, these sort of behavioral analytics are crucial to the enterprise. For that reason large enterprises are increasingly looking for tools that can help them understand when they really have pain vs. noise of traditional thresholding methods.
But… there is a downside to learning algorithms that is hard to program around. Suppose your current environment you have 100ms of latency to your storage devices. This is obviously not good, but if you plug it into a learning algorithm, it will learn that the behavior might be normal. Yet you know that it is not good, the algorithms would come to expect that and you get the inverse, when it deviates away from that expected behavior you could get false positives. In that way, a metric like latency is not a good fit for a learning algorithm by itself. It needs to look at multiple metrics to try to correlate behavior or use a traditional threshold base value where ‘Latency over Y’ is bad.
Just some food for though.