del.icio.us Digg DZone Reddit StumbleUpon
An IT Management Perspective on Improving System Availability - Willie Wheeler
« Previous | 1 | 2

What's the role of change in causing outages?

So now let's talk about the management problem of improving system availability. One plausible idea I've heard is to base the approach on the observation, attributed by some to Gartner, that 80% of unplanned outages are caused by "change". As an aside, I'm not myself sure that this is really what Gartner says. I did some Internet searches, and the actual statement, by Donna Scott, seems to be this: "80% of unplanned downtime is caused by people and process issues, including poor change management practices, while the remainder is caused by technology failures and disasters." I read that as including change but possibly also including other things like ill-conceived processes, carelessness, prioritizing firefighting to such an extent that root cause analyses are neglected, etc. (Though the article I just linked to would seem to agree that change is the primary cause.) Anyway, I'll just take it as a given that change is an important cause of outages, partly because of what Gartner says and partly because it just jibes well with my personal experience on the matter. I've pulled the botched-release all-nighter a time or two in my career. But does that really mean that change management is the silver bullet for availability woes?

Change is just one cause among many

Changes to the operational environment certainly cause outages, but so do a lot of other things. Poor capacity planning causes outages. Poor response processes (front-line and escalation) unnecessarily prolong outages and hence contribute to unavailability. Buggy software causes outages. So does poor monitoring, outdated distribution lists, inattention to root cause analyses and problem resolution, fault-intolerant system design, and a lot of other things. The key is to figure out which of those components is causing the most grief and to hit that first. (In this respect, it's very much like working on system performance issues: you look for the bottleneck, fix it, and find the next bottleneck if you're still having issues.) It may well be that we discover that change is responsible for our problems. In that case we'll reevaluate our change management processes and work with our teams to improve them.

Many environments are sufficiently complex that it's not always feasible to get to root cause for every issue. But it's not necessary to do so. As long as we can get at root cause for a reasonably representative sample of the issues that occur, we have enough information to steer us in the right direction. Sometimes we may have to just make a judgment call about the root cause, based on the available evidence, and that's probably unavoidable until we've established a certain capability and rhythm.

Some suggestions on reducing outages

Here are the suggestions that I would offer:

Track your outages in some central location. It's a lot harder to figure out what's causing outages if you don't have a list of actual outages in front of you. If tech support has one list and operations has another, try to get some consolidation or at least consolidated reporting around those.

Don't immediately jump to conclusions about what's causing your outages, even if Gartner says that on average 80% of unplanned outages are caused by change (and again I'm not even sure that's what Gartner is saying). That's just an industry average, and like any distribution, there's variance around that average. Your IT shop may be struggling so much with capacity planning, for example, that the percentage of issues caused by change skews downward from the industry average. If you allocate 500GB of SAN for an application that requires 1TB, you will eventually have an outage. If you set your monitoring thresholds at 97% disk utilization, there's a good chance that you won't be able to respond quickly enough when that alert actually trips.

Identify a reasonably representative sample of outages from your list and try to get to root cause on them, even if you have to make judgment calls along the way. Root cause analysis will always involve judgment calls anyway because you always have to make a judgment as to when you stop asking "why did that happen"? The point is to get a good feeling for what the likely causes are. It's not important to be 100% correct. And you may very well find that poor change management is causing a lot of your outages.

Prioritize the root causes and implement the appropriate fixes. It may help to matrix it out so you can see which problems are high-impact, low-effort and fix those first. If the root causes are technical in nature, implement the technical fixes. If they're process-related, work with the people who actually use the processes on a daily basis to understand how those processes might be improved. It's nearly certain that they will have important insights about the specific process that you as a manager need to understand.

Social bookmarks: del.icio.us Digg DZone Reddit StumbleUpon
« Previous | 1 | 2

Comments (1)

I agree that systems need to be designed for resiliency, without that you end up chasing your tail in an effort to increase availability. It?s much easier to do in an initial design than to try and augment an integrated and often complex system. Do it right the first time, it may cost a little more upfront and take a little more time, but will usually save you a ton of money and headache on the backend.

In regards to you comment about the difficulties of having many people, with varying experience, backgrounds, management approaches, etc., the one thing I can usually spot if I stand back and observe is that everyone wants the same result, to improve the availability for instance. The contention usually comes in the form of the differing approaches (opinions) to getting there. One thing I?ve learned is that I?ll offer my experience and suggestions in hopes that they?ll be taken into account, but inevitably I?ll let them do their job ? they?re the experts and should have more insight than I.

Actually being in an operations area I get the benefit, if you want to call it that, of seeing what's causing the outages, and developing a good sense of what's going on. Gartner?s actual statement was "80% of downtime is due to human error and problems created by process, such as inadequate testing and unauthorized changes". I will agree with this, and I don?t think Change Management by itself is the silver bullet, however, I do think effectively managing the changes in your environment will greatly improve availability. The Change Management process it self needs to be effective, there needs to be accountability, and there needs to be an understanding of what the change will affect in the environment.

Yes, track outages holistically, couldn?t agree more. How can you improve what you can?t measure? But when you?re using multiple systems/tools to do the same process?easier said than done.

Much of what you?re talking about and suggesting is directly related to ITIL, or IT Service Management. You?ve touched on Problem, Change, Availability, Capacity, Event, and Incident Management in this Blob. Take a look at it sometime, I think you?ll find it interesting. :)
By Bob Leitner on Jun 5, 2008 at 3:39 PM PDT

Post a comment

Your name:
Your e-mail address (won't be displayed):
Your web site (optional):
example: www.xyz.com
Your comment:
Preview:
By You
Please help us reduce comment spam:
Spring Annotations RefCard
Check out the new DZone Spring Annotations Refcard by Craig Walls!

What's New?

2008-12-14 - We've just submitted a few more chapters of the book for review, so we're about halfway done.
2008-10-20 - I've added a new mailing list feature to the site. Sign up to receive e-mail updates about new articles.
2008-09-30 - We've released chapter 4 (User registration) and chapter 5 (Authentication) of Spring in Practice.
2008-09-11 - By popular demand, I've added an RSS feed to the site.
Home | Consulting | Tech Articles | Mailing List | Contact | Spring Blog
Copyright © 2008 Wheeler Software, LLC.