Category Archives: writing

research on monitoring and the modern sysadmin

This morning I’m working through a chapter in my book (the Operations Primer, tentatively) on monitoring and logging.  Something that is important to me to communicate in this chapter is how different a monitoring ‘solution’ is from what many people think they’re doing, but I want to do this without sounding snarky.  It’s tough.  

As a preview to what I’m writing, here’s a breakdown of what I think needs to be included in any comprehensive monitoring solution:

“A monitoring system should have the following characteristics:

Delivered from a system-neutral platform

100% available (or as close as you can get it)

Available based on sensible access controls and least-privilege security

Able to deliver information in a flexible manner

No system should be the exclusive source of monitoring information about itself

 

The actions performed by the system should include the following functionality:

 

Generating alerts based on conditions

Generating alerts based on heuristics

Resolving alerts manually

Resolving alerts based on conditions

Displaying performance metrics

Recording event history

Enabling a ‘maintenance mode’ manually or on a schedule

Launching utilities to perform common maintenance tasks

 

Other valuable characteristics of monitoring systems:

 

The admins who run the systems configure the monitors themselves . . .”

And that’s where I left off.  The next thing I planned to discuss was the building blocks of monitoring, such as SNMP, ICMP, and data sources, but the last thing I wrote in that list above made me stop and think.  The idea that admins should be in charge of configuring their own monitors is a lesson I learned myself while I was a SAN administrator trying to get help with using Zenoss.  Which I was not allowed to configure on my own.  This kind of lesson that was learned through pain and annoyance is exactly the kind of material I want to include in my book, so what other lessons have people learned out there?

 

This led me to stop writing and to start doing some research.  At 6:20am I’m not going to be able to bend the ears of too many of my colleagues, but the inter-webs have provided me with two videos from sysadmins who have a lot to say:

 

“The evolution of the SysAdmin & holistic monitoring for apps and servers” by Matt Simmons.  Provided by Solarwinds and requires registration.

 

“Monitoring Maturity: a 16 year journey and lessons learned” by Simon Finch at Nagios Con 2014