*Zara
Photo credit: *Zara
del.icio.us Digg DZone Reddit StumbleUpon
1 | 2 | Next »
IT Management

An IT Management Perspective on Improving System Availability

Ideas on how to approach the problem of improving system availability, written from an IT management perspective.

From a technical point of view, there are lots of best practices around reducing the number, severity and scope of system outages. For example, we might use redundant system components with automatic failover, try to make systems integration points bulletproof, automate deployments to reduce or even eliminate release windows, isolate components of the system from one another, and so on. While I won't call those things "easy", they are in many cases solved problems and it's just up to people to learn the techniques and apply them. We know how to set up Oracle RAC, how to deploy a pair of redundant load balancers, how to implement retry logic in an application, and how to serve up a web page even though one of the data sources is unavailable.

Not as easy as it sounds

From an IT management perspective, however, the challenges around system availability are of a much different sort. Ideas that seem obvious to someone coming from a software development background will often directly contradict the ideas that seem obvious to somebody from infrastructure and ops. And there may not be any quick way to resolve those differences, especially when they stem from deep-seated differences in management philosophy. When people from product development, internal-facing software development, infrastructure and operations, and tech support get together, the ideas on simply how to approach the problem can easily lead to some pretty heated (and often entertaining) "discussions". The issues go well beyond the purely technical:

  • When there's an outage, who are the first responders? Tech support? Operations? Some combination of the two? What level of product expertise do they need? What level of functional expertise do they need?
  • How do we make sure that the people working the issue have the technical and contact information they need? How do we deal with the fact that the relevant information is distributed across multiple wiki spaces, Sharepoint, Service Desk, Talisma Knowledgebase, multiple asset management tools, application configuration files, ad hoc databases, outdated wall charts, people's brains ("tribal knowledge"), and so forth?
  • What's the escalation process? Do we have a standardized path here or does it depend on the system involved?
  • How do we get to root cause on issues and get fixes in place so we can prevent the same thing from happening in the future?
  • How do we balance the desire to snapshot the current state of an unavailable system (for diagnostic purposes) with the desire to restore service as quickly as possible?
  • How do we help people understand the difference between service restoration and problem resolution?
  • How do we properly resolve the Nagios vs. Sitescope disagreement between these two groups, or the Data Guard vs. Golden Gate debate between these other two groups, etc.?
  • When there are multiple simultaneous outages, how do we decide which system gets priority, when different people at different levels have different ideas about which systems are more important? How do we deal with the fact that the rules around priority may be fairly complex (system A is generally more important than system B, but if it's Sunday night, then system B on this set of pools is definitely more important, and if it's Monday night, then system B on this other set of pools is definitely more important).
  • What are our assumptions about outages? Is it acceptable to have planned outages for releases and maintenance? How long should the release and maintenance windows be? What happens if we're a global company with customers in multiple time zones? Are we still OK with planned downtime?

Despite the fact that it's easy for any individual person to come up with obvious-sounding answers to questions like the ones just listed ("Hey, how about if everybody uses <insert_favorite_tool_here>?"), it's not nearly as easy once you have hundreds of people involved, hundreds of systems, thousands of servers, and leaders with different backgrounds and experience, different management approaches, different levels in the organization, and sometimes conflicting motivations or potentially adversarial organizational relationships. (For example, if there's an outage, developers and operations start pointing the finger at each other; the database team blames the storage team, the storage team blames the developers for being wasteful with disk, etc.) People make mistakes and may hide them. These and myriad other challenges contribute to system availability being "easier said than done".

Social bookmarks: del.icio.us Digg DZone Reddit StumbleUpon
1 | 2 | Next »

Comments (1)

I agree that systems need to be designed for resiliency, without that you end up chasing your tail in an effort to increase availability. It?s much easier to do in an initial design than to try and augment an integrated and often complex system. Do it right the first time, it may cost a little more upfront and take a little more time, but will usually save you a ton of money and headache on the backend.

In regards to you comment about the difficulties of having many people, with varying experience, backgrounds, management approaches, etc., the one thing I can usually spot if I stand back and observe is that everyone wants the same result, to improve the availability for instance. The contention usually comes in the form of the differing approaches (opinions) to getting there. One thing I?ve learned is that I?ll offer my experience and suggestions in hopes that they?ll be taken into account, but inevitably I?ll let them do their job ? they?re the experts and should have more insight than I.

Actually being in an operations area I get the benefit, if you want to call it that, of seeing what's causing the outages, and developing a good sense of what's going on. Gartner?s actual statement was "80% of downtime is due to human error and problems created by process, such as inadequate testing and unauthorized changes". I will agree with this, and I don?t think Change Management by itself is the silver bullet, however, I do think effectively managing the changes in your environment will greatly improve availability. The Change Management process it self needs to be effective, there needs to be accountability, and there needs to be an understanding of what the change will affect in the environment.

Yes, track outages holistically, couldn?t agree more. How can you improve what you can?t measure? But when you?re using multiple systems/tools to do the same process?easier said than done.

Much of what you?re talking about and suggesting is directly related to ITIL, or IT Service Management. You?ve touched on Problem, Change, Availability, Capacity, Event, and Incident Management in this Blob. Take a look at it sometime, I think you?ll find it interesting. :)
By Bob Leitner on Jun 5, 2008 at 3:39 PM PDT

Post a comment

Your name:
Your e-mail address (won't be displayed):
Your web site (optional):
example: www.xyz.com
Your comment:
Preview:
By You
Please help us reduce comment spam:
Spring in Practice
My brother and I are writing Spring in Practice for Manning!

What's New?

2009-08-30 - Check out my two-part series on DZone: Spring Integration: A Hands-On Tutorial.
2009-03-25 - My new article Getting Started with Spring Batch 2.0 is available on DZone.
Home | Consulting | Tech Articles | Mailing List | Contact | Spring Blog
Copyright © 2008 Wheeler Software, LLC.