controls in an outage
google plus is what i wanted livejournal to be

the case of the disappearing metrics

I'm so mad at myself right now.

We're getting statsd and graphite deployed, and we've been feeding some counters into statsd. We've generally found this to be fairly straightforward, except that one of our counters has been a bit schizophrenic. Sometimes it reports correctly, and sometimes it reports zero (rather than null). In fact, watching the whisper files directly, I sometimes see the timestamp and the value written to the file, and then immediately overwritten with a zero in the next run.

Hachi, Franck, and I pored over every configuration file, we pored over the source code to carbon and whisper. We set up the manhole in carbon so we could ssh in and see what it thinks the world is. We ran tcpdumps every which way. We turned off various daemons under a hunch that it might be something else writing to carbon and we wanted to track it down. We twiddled frequencies, retention policies, everything. Nothing, no dice, everything looks like it should.

We've been digging into this for about two days now. Franck and I were banging our heads against it today when we saw something unusual:

$ echo counters | nc statsd 8126 | grep 18342 0

Huh. Interesting. ... oh, no, those actually become the same metric in graphite. So even though statsd thinks it's two different metrics, carbon ends up coalescing them down into one. One of them wins, depending on how statsd flushes them and carbon sorts them, causing the flip-flopping. What's worst is that the second one is actually a bug in code that I had written. It's supposed to emit statistics for "", but there's a subtle case where the subcategory could be empty.

(As an aside, we did find an issue with how we had statsd and graphite talking to each other. In particular, statsd's flushInterval needed to be configured to be the same as one of the archive intervals in graphite, or else some of the stats get discarded.)

comments powered by Disqus