We had a day-long war room yesterday wherein we ported a lot of our scripts and systems from writing to Ganglia to instead write to Graphite. We got to simplify a lot of stuff, primarily around rate-of-change calculations. Instead of doing those ourselves, we were sending in total values, and letting Graphite handle the "per second" part of things.
So it was a little odd when Hachi called me over and pointed out something funny. He was plotting the function "derivative(metric.memcache_gets)", where memcache_gets is the total lifetime gets that it has served. Over the last hour, the graph hovers around 3k. When I zoom out to the last 3 hours, though, it hovers around 180k. Weird. Hachi did point out that the 1 hour mark is when we switch aggregation intervals and start storing coarser data. That didn't smell right, though.
A derivative is a rate of change of one thing with respect to another. It's a rate of change, and since Graphite graphs according to time, you get the units of the y-axis (memcache gets) divided by the units of the x-axis (sec). That's what a derivative is, basically. It's the slope of the graph. Rise over run. Change in y over change in x. Algebra.
So when I went code-diving, I was surprised to see that the "derivative" function in Graphite just plots the difference between successive data points. So as we moved to coarser intervals, the difference became greater. (I didn't notice at the time, but the delta was 60x). Apparently there's a "perSecond" function that takes the delta between two data points, divided by the delta in time between them. That's the one we want.
Jeez. "perSecond" is the real derivative. "derivative" really is just a delta. How confusing and misleading.
(Personally, I think everything should be normalized to "per <unit of time>", ideally seconds, in Graphite. The data here is completely dependent on the granularity of the backing file, which basically means that you don't know what units to use unless you know how the data is stored. Same with the "count" metric that statsd emits with counters. "rate" is normalized and is accurate no matter what, whereas "count" depends on the flushInterval, etc. At least that one, we've blacklisted so it doesn't get recorded in Graphite anymore.)
What is the bug # that you filed with the graphite devs?
Posted by: Charlie | Friday, January 25, 2013 at 09:55 AM
I haven't, yet. It seems like it would break existing use to make this change, even if it is to fix it. (Maybe people have come to expect derivative to be a delta between data points.) But maybe I should let them make that decision?
Posted by: Abe Hassan | Friday, January 25, 2013 at 10:48 AM
Have you looked into the way graphite calculates the 90th percentile? I suspect they're doing that wrong too. I've been thinking about that recently hard enough to strain a muscle, and I'm pretty sure you can't calculate percentiles from aggregated time periods. You need either the raw original data or a big enough random sampling of the original data where the confidence interval for the number of things in that top 10% is acceptable.
So if in the first time period you have the values 1..100, and in the second time period you have the values 1..30, the 90th percentile over the combined time period should be 87. But if you've aggregated away all the values and saved only something like the Five Number Summary, I don't think you're going to be able to come up with that.
This is much as I've been able to puzzle out from studying the Cartoon Guide to Statistics anyway.
> Algebra.
Calculus, actually ;-)
Posted by: Kevin G. | Monday, February 04, 2013 at 04:53 PM
Yes, you're totally right. You need more information than the Five Number Summary. You can turn two averages into an average if you have the number of original data points; and you can turn two stddevs into one if you have the averages and number of data points. But percentiles are a whole 'nother story.
My suspicion is that Graphite's concept of percentiles is related to the data points it has stored. So it's not the 90th percentile *at that point*, but rather the 90th percentile of the data in the metric. To get 90th percentile at a given point in time, I would use statsd, which can calculate that and emit it to Graphite.
So there's a percentile at a point in time, and then a percentile across all time (or across the last X data points). I suspect Graphite is doing the latter. Technically valid, but super duper confusing.
Posted by: Abe Hassan | Monday, February 04, 2013 at 05:24 PM