I think we all agree: #monitoringsucks
For me, the holy grail is the single pane of glass (pardon the douchey term), a unified view into your environment that shows you everything you need to know. Latest trends, latest service failures, latest deploys, latest everything. Real-time. So many types of devices! So many types of data! So many ways to visualize and analyze it!
Instead we end up with a world where you have a bunch of tools doing part of a number of duties. Nagios runs checks, but so does Jenkins, and so does Riemann. Collectd collects data, but so does Ganglia, so does OpenNMS, so does Splunk, so does CloudWatch, so does Nagios. Cacti polls your network devices, but so does OpenNMS, so does Nagios. This almost leads me to believe that you need N+1 tools to perform N duties, because they're all so horrible at it. And we haven't even begun to solve the problem of combining data from multiple systems; god forbid you want to put together an alerting policy.
I've been chewing on this for a while, and my thoughts on a comprehensive toolset does indeed still require at least a half dozen pieces of software, but with clear roles and responsibilities:
- Collectd collects system and application metrics from each machine (ie, pull-based mechanism);
- Statsd aggregates application-level data points into metrics (ie, push-based mechanism);
- ???? polls network devices or other sorts of appliances that can't push data themselves;
- Graphite consumes metric data for visualization;
- Riemann consumes metric data for trend analysis and alerting;
- Syslog centralizes all system and application logs;
- Logstash analyzes and reports on those logs;
- Jenkins builds code, continually runs tests, triggers deploys.
What's missing from this is the glue that treats all of this data as a cohesive whole:
- Correlations across all services: this Jenkins job and this metric and this check and this log file all correspond to the same application.
- Dependency management: this server depends on this network device depends on this power strip, so if that happens, page the datacenter guys but just email the server guys.
- Show me all of the above -- graphs and top 10 errors and current alerts -- all in one view, and give me another view that shows me an outage map based on the relationship between services.
Autodiscovery here might be tough. You might have to actively dictate which metrics and which alerts etc go together. If it has to be explicit, as I expect it might be, then this could be where Nagios steps in. It can accept passive checks from each of the above services, and apply some logic to manage dependencies and service/host groups. We can also use that same level of explicitness to tie together the single pane of glass, the cohesive look into our environment.
I need to continue reading up and researching in this area to try to understand a couple things: can we cull that list of services down further? Are there better tools than Nagios to glue together these monitoring services, to better manage dependencies and relationships? Does a cohesive, pluggable dashboard app exist that can allow us to build the single pane of glass? I will share my findings as I have them, but of course, feedback and input is always appreciated.