i hate administrator privileges

Often as system administrators, we're doing three things at once: administering an application or a service; providing support to the folks who use that service; and using that service itself.

In systems where permissions can be assigned on a fine-grained level, those tend to get rolled up into a handful of roles. The most basic case is to have two roles, "user" and "administrator", where the latter has all the privileges of the former. When answering a user's question, I've frequently found myself wondering: is this something everyone can do, or is it because I have elevated privileges?

Unless the permission system is VERY explicit — consider sudo for example — it's kind of hard to keep up. This is a little easier with applications that have a separate administrative panel. You know you're there, and it's obvious that access to it is a function of your role. But those roles and permissions extend into the rest of the application. In an application where every screen and button and field relies on some level of access, visually indicating the required access level isn't particularly practical.

I've come to the conclusion that I want two accounts, in cases like this. I want a regular user account for my day-to-day use, which I can use to provide instructions or guidance to others; and an administrator account that I can use to resolve issues or help with other tasks.

I've been using this pattern with our Gitlab installation at work and I've found it useful. It is a little annoying to switch between two different browsers, but the hassle has been worth it, many times over, in being able to give clear and useful instructions to other people. And, the hidden benefit: it highlights the places where our permission assignments aren't sufficient for everyday users. 


multiple google calendars in the ios calendar app

I like using the built-in iOS Mail and Calendar apps for my Google accounts. I can't delete 'em, so I might as well use them. Otherwise my OCD will take over, and in general they seem to work just fine for me and my weird workflows...

.. Except for one thing: They only show my primary calendar. I have access to a handful of shared calendars and there's no easy way to display them.

Turns out that there's an option to display more calendars. Go to the Google Mobile Sync page from your phone, find your device, and check off which calendars you want to sync. Then blammo, the Calendar app will allow you to enable/disable syncing any or all of them. How convenient.

(And yes I see the "Sync with your Mobile Device" option in Google Calendar, but the diddler to set up multiple calendars took me through a whole bunch of FAQs. It's kind of buried in there.)


magic numbers and tuning syslog-ng

WARNING: window sizing for tcp sources were changed in syslog-ng 3.3, the configuration value was divided by the value of max-connections(). The result was too small, clamping to 100 entries. Ensure you have a proper log_fifo_size setting to avoid message loss.; orig_log_iw_size='25', new_log_iw_size='100', min_log_fifo_size='100000'.

Most of what I've found on the Internets (including this fantastic syslog-ng tuning guide from Etsy) will tell you to adhere to this formula:

max_connections * log_fetch_limit <= log_iw_size

They also tell you that log_fetch_limit defaults to 10. So you do the math a little bit and you wonder why setting log_iw_size = 10000 with max_connections = 1000 still gives you a warning. Turns out there's another restriction, and yes, the right hand side is a magic number:

log_iw_size / max_connections > 100

The kicker is that the error message we get is super useless, because it tells you that it's adjusting the log_iw_size per connection to a minimum of 100, but that's actually just its internal bookkeping. You can't control that number directly. You actually set a large-enough log_iw_size so that, across all connections, they each get over 100.

In other words, the error message gives you a lot of config options and values, none of which actually help you fix your config. Jeez. I've sent in a pull request to adjust the error message (and after discussion, it looks like the warning only addresses the actual mathematical issue, and not the broader tuning question...)


package management and the devops revolution

All this talk over the last few years about devops skirts around one particularly cynical take on traditional system administration: that it slows product development down. That error-proofing everything is unnecessary, that it's purely overhead and catastrophizing, at the expense of shipping new features. For many, certain failure modes aren't worth preventing, and there's a whole lot of these tradeoffs happening in the deployment space in order to ship new features, faster.

I think the core of this comes down to package management.

This is, by far, the biggest point of friction I've experienced. Certainly there are other places where traditional system administration is being improved or automated -- like better monitoring, better provisioning, etc. But I think that developers are looking for a more progressive approach to package management while systems administrators are a little more conservative.

In the broadest of strokes, packaging is the act of taking software and making the installation process repeatable. In turn this means that we're generating an artifact once, and installing it on a collection of servers; which means the state of the system is inspectable; which then means that we can ensure that every server is consistent. Tools like apt and yum are the usual things we think of -- and they're very mature package managers -- but even something like AMIs or tarballs could do the trick, depending on your needs.

It means that you're not re-compiling, re-building, re-executing your build steps on every single server. Putting aside the time it takes, you're ultimately multiplying the number of times you're doing an operation, increasing the likelihood of failure. (There's also a security angle here, in that some would argue that production hosts shouldn't have the "build" toolchain available: either installed or over the network. I personally agree with them, but I acknowledge there's tons of debate on this point.)

But it's slow and painful.

If you're relying on your upstream provider to give you packages, you're pretty much restricted to an out-of-date existence. Bugs are fixed in newer versions! Github is social coding! We add features to the libraries we use! Waiting on someone to give you packages is a losing proposition. When we optimize for multiple deploys a day, waiting a month or a year for your upstream to provide new functionality to you is a non-starter. If we want the latest and greatest -- and we often do -- you need to figure out how to package it.

Using pip or gem or cpanm is easy. Taking that output, packaging it, figuring out how to make it relocatable, etc, isn't very easy. Making sure that pip installs the exact same thing every time isn't easy. (You might specify exact versions of all your dependencies, but do those dependencies do the same?) Packaging each individual module takes forever. Packaging your app with its dependencies is bulky. Tools like fpm get us a lot closer, but it's still an extra step.

When "npm install" works most of the time, it's easy to start relying on it all of the time. When you're already deploying multiple times a day, it's easy to not want to make any particular installation repeatable. When you can replace any machine and just re-run your setup script, it's easy to not trap and handle failures in that process. But resiliency and convergence mean more than just "if I try this enough times, eventually it will work".

Over the last decade, I think the pendulum has swing back and forth from "package everything as an rpm and write an init script" to "just run it out of your homedir with screen". I don't like swinging between the extremes -- neither is great. Finding a healthy balance, somewhere in between, is devops is about. It's about finding a way to move fast while still maintaining a rapidly-deployable, operable system.


why we duplicate metrics

A few months back, we were getting the graphite and statsd stack set up in our world. The frontend functionality offered by Graphite and the easy-to-use interface offered by Statsd are incredibly powerful. But our use of statsd quickly presented a problem to us: we wanted to understand cluster health, but we also wanted to understand per-host health in a way where we could do blue/green deployment.

Statsd doesn't easily give you that ability, at least not out of the box:

  1. If you run a statsd server on each machine, then your metrics stay separated even up until you get to Graphite, so you don't get an at-a-glance view of the cluster.
  2. If you run a single, central statsd server, you lose per-host visibility. If your app emits a data point for "metricname", then all hosts' data points for that metric get combined together.

You might be thinking that the solution to (1) is simple: that you either use aggregation functions for statistics on the way in, or you use the functions in Graphite's frontend to combine metrics. Unfortunately the math doesn't work in your favor. You can get awfully close! But it won't be accurate.

The reason the math doesn't work is that the summarization is lossy.

For instance, if you're recording 99%ile latencies from each machine, you can't use that to determine the 99%ile across the cluster overall. There's only three cases where it's guaranteed to work: max, min, and sum (or count). But it doesn't work for averages, medians, percentiles, etc. You need more information, perhaps even the original data set, in order to derive that data from the constituent sets. It works when every machine's metrics are identical. It works less and less when when one machine is anomalous. And since detecting anomalous behavior is the original goal we had in mind, "close enough" wasn't good enough.

So we log our metrics twice, into a central statsd server: once with the hostname, once without.

Now, we didn't want our developers to have to call into statsd twice -- DRY and whatnot -- so we run a proxy on each host. This proxy intercepts statsd calls and multiplexes them1. The proxy also munges the name into a standard naming scheme, based on the environment we're in. So, an application developer only needs to submit a data point called "appname.metricname" to statsd (really, the proxy), and the proxy emits it into statsd twice.

It's a little extra work at the head end of the processing chain, but it gives us a couple ways of bisecting data that we wouldn't otherwise get without either customizing statsd (or the clients) or asking our developers to duplicate all their calls. Hachi described how we've laid out our system in more detail, over at his blog.

For now, we've published our statsd proxies to Say Media's "devops-tools" repo on Github. If it's interesting enough to others, we'll put some effort into trying to get something like this moved upstream, either into the clients or into statsd itself. Let me know! Comment here, tweet me, or just find me at Velocity!

 

1 Technically we multiplex everything other than "gauge". Because a gauge is an absolute number, recording a cluster-wide metric doesn't really work -- the last one to submit a value in a particular flushInterval wins. Not very interesting, and it's confusing, so we only log gauge values once.