Book Review: Visible Ops Handbook, Kevin Behr, Gene Kim, and George Spafford
You know the old saw about, "Sorry for the long letter; I didn't have time to write you a short one." ? This is a short one, but I don't feel at all cheated on page count. It's a Good Thing when a book covers the topic...and stops.
The authors codify the operational approaches that highly proficient IT shops have adopted. I'm hostile to dumb performance metrics, but using some measurements even I can agree are useful, they identify high-performing organizations. Some of the metrics: unplanned downtime, Mean Time Between Failure (MTBF), Mean Time to Repair, % staff time spent on unscheduled|unplanned work, and ratio of servers to administrators supporting them. They noticed a quantum-gap, where the outfits that did well in these areas tended to do well in all of them, and there wasn't really a continuum. Organizations were either high performers or low, not a lot in the middle.
Turns out the high performers all independently adopted similar operational approaches, and there really isn't a middle way. There's a discipline to the discipline, and it starts with "the only acceptable number of unauthorized changes is zero."
High performing IT shops have a culture of change management. They cite a stat indicated that 80% of outages (incidents or time? both?) are self-inflicted. That's obviously the place to look for improvements. And you won't get improvements with just a little change management.
This approach has a lot to offer besides operational efficiency. IT goons have to deal with useless auditing and compliance directives. (Some of it is worthwhile, but it looks like even the worthwhile efforts are not well done in practice.) Having effectively managed controls in place makes for auditable networks.
The Mean Time to Repair is improved by a Culture of Causality. Once you have the change control in place, you can have faith that you know when something changed, and look in those places for the cause. Proficient shops - even those running Windows - reboot 1/10 as often as their less proficient counterparts. It's possible to hit a 90% first-fix rate.
They claim (and I believe) that it's also cheaper to rebuild than repair. Figuring out what went sour is time-consuming and uncertain. Automated rebuild is the way to go.
Interestingly, they claim that the frenzy of patching that many of us go through is not part of the culture at these proficient shops. A patch is a change, and subject to the same build verification any other architecture change would be. Consequently, OS patches get rolled out more or less organically, as part of a whole system. This is a little harder for me to swallow. I have been immersed in the SANS koolade. Depending on the application, I don't think you can wait for some patches. I do grok the "one, few, many" approach used by Tom Limoncelli (http://www.aw-bc.com/catalog/academic/product/0,1144,0201702711,00.html) Pilot a patch, test, expand to a pilot group, test, release to production, cross fingers. If you have a disciplined shop, this approach works. If you are like the rest of us, the balance of risk suggests to me you are better off dealing with the turmoil a patch might cause than live with a worm outbreak.
Another interesting point is that change management is MORE important during a crises. Convene your change approval team but stick to the discipline lest you make things worse.
The authors claim that a transition to the techniques and culture that the high performing organizations have in common has been done in a few months. I don't see it happening that quickly around here, but we could certainly get started. They outline 4 phases:
1) stabilize the patient - set up an "electric fence" so that you can monitor configuration changes and hold staff responsible for unauthorized changes. Confining changes to those approved by a change management team, and only during maintenance windows will have an immediate effect, they claim. But accountability is key - without it, there's an inevitable slip back to the sloppy practices everyone is used to. The fence comes from tools like Tripwire, which can tell you when things change. You can then refer to authorized changes/work orders and see if they match up. If not, some coaching is in order.
2) ID the fragile systems you don't dare touch
3) develop a repeatable build library so you can start moving services off the systems identified in #2
4) continuous improvement
I am not sure how much this applies to my environment - we are pretty stable, but not proficient. Most of the action on our net is at the desktop level, and I think this is aimed more at network operations and the data center than the Help Desk. It has me thinking, though. If we could extend the principles to everybody, what would change? What would it look like?