## graphite's derivative function lies

##### Friday, January 25, 2013

We had a day-long war room yesterday wherein we ported a lot of our scripts and systems from writing to Ganglia to instead write to Graphite. We got to simplify a lot of stuff, primarily around rate-of-change calculations. Instead of doing those ourselves, we were sending in total values, and letting Graphite handle the "per second" part of things.

So it was a little odd when Hachi called me over and pointed out something funny. He was plotting the function "derivative(metric.memcache_gets)", where memcache_gets is the total lifetime gets that it has served. Over the last hour, the graph hovers around 3k. When I zoom out to the last 3 hours, though, it hovers around 180k. Weird. Hachi did point out that the 1 hour mark is when we switch aggregation intervals and start storing coarser data. That didn't smell right, though.

A derivative is a rate of change of one thing with respect to another. It's a rate of change, and since Graphite graphs according to time, you get the units of the y-axis (memcache gets) divided by the units of the x-axis (sec). That's what a derivative is, basically. It's the slope of the graph. Rise over run. Change in y over change in x. Algebra.

So when I went code-diving, I was surprised to see that the "derivative" function in Graphite just plots the difference between successive data points. So as we moved to coarser intervals, the difference became greater. (I didn't notice at the time, but the delta was 60x). Apparently there's a "perSecond" function that takes the delta between two data points, divided by the delta in time between them. That's the one we want.

Jeez. "perSecond" is the real derivative. "derivative" really is just a delta. How confusing and misleading.

(Personally, I think everything should be normalized to "per <unit of time>", ideally seconds, in Graphite. The data here is completely dependent on the granularity of the backing file, which basically means that you don't know what units to use unless you know how the data is stored. Same with the "count" metric that statsd emits with counters. "rate" is normalized and is accurate no matter what, whereas "count" depends on the flushInterval, etc. At least that one, we've blacklisted so it doesn't get recorded in Graphite anymore.)