Doing DevOps isn’t easy, I learned the hard way that you can’t force it from the ground up. That said, there are a lot of opportunities to bring some DevOps to personal practice. It doesn’t matter how large the organization.
One place that is often neglected is monitoring. In this case, the neglect isn’t about not doing it. There is a lot of advice about Key Performance Indicators (KPIs) and the monitoring the application vs system metrics and active vs passive. I don’t want to get into what all should be monitored (I don’t think that over-collection of metrics is a thing – you never know when you could use that data). I’d like to focus on being a good steward of whatever monitors you have. That includes the ones inherited by those who went before you. So here are some thoughts on monitoring and alerting I try to live by. Ultimately this will:
- reduce technical debt
- demonstrate ownership and leadership
- provide an opportunity to practice DevOps
know what you’re monitoring and why
Monitors are easy to roll out. This is particularly true for system things such as disk, memory, I/O etc. Those stats in isolation aren’t generally very interesting, but they can be helpful when there is a real problem. High memory usage is generally never a problem in and of itself. I’ve seen well-tuned systems running at 100% memory utilization without issue. The same goes for CPU. Having a 50%-idle-CPU is great to keep the alarm from triggering at 75%, but having a lot of idle CPU is also a waste of resources.
Generally, there are other indicators, well KPIs really, which give you a sense that things are broken. The focus needs to be on those. Much of the other data is there to augment and help troubleshoot the real problem.
It’s also awesome if you can document the why. Linking that info from the monitoring system makes things much easier to keep the knowledge alive.
get rid of things you don’t understand
Things change over time. I’ve built monitors in the past, which stopped being useful. It’s hard to remember to go back and clean up. I’ve also seen alarms which had been silenced for months. Every monitoring system has a way to do that, but there are generally no provisions to clean those up.
It’s well worth the time to review monitors periodically. I’ve been part of a weekly review of all alarms, which included things that were silenced. If there was no plan to resolve an alarm and nobody could remember why it was there, we purged it. Another great option is to turn silenced alarms on again. Sometimes that helps remember that they were disabled since they triggered every time a backup took place at 2 am.
adjust the settings for your needs
Virtually every monitoring system has default settings. It’s intended to make things easy and get stuff monitored quickly. The easy way is rarely the right way. Even something as simple as disk check can be very different depending on whether it’s for a tightly configured DNS box without heavy logging or a database server, which sees a lot of tables being duplicated. It gets harder from there.
If you don’t get things configured right at roll-out, you’ll end up coming back and getting it right. Odds are you’ll end up revisiting it several times, but that’s no excuse to just pat yourself on the back for rolling out defaults. Thinking about the settings will force a better understanding. This is critical if you want to make use of that data later. After all, if you don’t know why it’s alarming at a certain point, then you can’t be sure that’s an actually meaningful alarm.
monitor all you want but only alert sparingly
Again, defaults often encourage sending alarms for everything. You shouldn’t. I’ve received pager storms, and they only make it harder to find what’s really going on. The KPIs are likely one of the few things that should alert. Some alarms are not worth waking people up for. For example, an expiring SSL cert that warns 2 weeks early, is OK to only alarm during business hours.
With those things in mind, I recommend you only alarm and wake people up for production issues. All other environments are by definition not production and therefore not critical. I know, if development can’t happen, then you might as well be down. I’d counter that if it’s worth alerting on alarms in staging, then it should also warrant bringing in the account and product managers and possibly the executives to address this critical issue.
The above approach has served me well over the years. It’s empowering to make these changes and helps reduce the burden of being on-call. You’ll also improve the overall health of the environment by reducing complexity and increasing understanding.
Go forth and DevOps.