*Zara
Photo credit: *Zara
del.icio.us Digg DZone Reddit StumbleUpon
1 | 2 | Next »
IT Management

An IT Management Perspective on Improving System Availability

Ideas on how to approach the problem of improving system availability, written from an IT management perspective.

From a technical point of view, there are lots of best practices around reducing the number, severity and scope of system outages. For example, we might use redundant system components with automatic failover, try to make systems integration points bulletproof, automate deployments to reduce or even eliminate release windows, isolate components of the system from one another, and so on. While I won't call those things "easy", they are in many cases solved problems and it's just up to people to learn the techniques and apply them. We know how to set up Oracle RAC, how to deploy a pair of redundant load balancers, how to implement retry logic in an application, and how to serve up a web page even though one of the data sources is unavailable.

Not as easy as it sounds

From an IT management perspective, however, the challenges around system availability are of a much different sort. Ideas that seem obvious to someone coming from a software development background will often directly contradict the ideas that seem obvious to somebody from infrastructure and ops. And there may not be any quick way to resolve those differences, especially when they stem from deep-seated differences in management philosophy. When people from product development, internal-facing software development, infrastructure and operations, and tech support get together, the ideas on simply how to approach the problem can easily lead to some pretty heated (and often entertaining) "discussions". The issues go well beyond the purely technical:

  • When there's an outage, who are the first responders? Tech support? Operations? Some combination of the two? What level of product expertise do they need? What level of functional expertise do they need?
  • How do we make sure that the people working the issue have the technical and contact information they need? How do we deal with the fact that the relevant information is distributed across multiple wiki spaces, Sharepoint, Service Desk, Talisma Knowledgebase, multiple asset management tools, application configuration files, ad hoc databases, outdated wall charts, people's brains ("tribal knowledge"), and so forth?
  • What's the escalation process? Do we have a standardized path here or does it depend on the system involved?
  • How do we get to root cause on issues and get fixes in place so we can prevent the same thing from happening in the future?
  • How do we balance the desire to snapshot the current state of an unavailable system (for diagnostic purposes) with the desire to restore service as quickly as possible?
  • How do we help people understand the difference between service restoration and problem resolution?
  • How do we properly resolve the Nagios vs. Sitescope disagreement between these two groups, or the Data Guard vs. Golden Gate debate between these other two groups, etc.?
  • When there are multiple simultaneous outages, how do we decide which system gets priority, when different people at different levels have different ideas about which systems are more important? How do we deal with the fact that the rules around priority may be fairly complex (system A is generally more important than system B, but if it's Sunday night, then system B on this set of pools is definitely more important, and if it's Monday night, then system B on this other set of pools is definitely more important).
  • What are our assumptions about outages? Is it acceptable to have planned outages for releases and maintenance? How long should the release and maintenance windows be? What happens if we're a global company with customers in multiple time zones? Are we still OK with planned downtime?

Despite the fact that it's easy for any individual person to come up with obvious-sounding answers to questions like the ones just listed ("Hey, how about if everybody uses <insert_favorite_tool_here>?"), it's not nearly as easy once you have hundreds of people involved, hundreds of systems, thousands of servers, and leaders with different backgrounds and experience, different management approaches, different levels in the organization, and sometimes conflicting motivations or potentially adversarial organizational relationships. (For example, if there's an outage, developers and operations start pointing the finger at each other; the database team blames the storage team, the storage team blames the developers for being wasteful with disk, etc.) People make mistakes and may hide them. These and myriad other challenges contribute to system availability being "easier said than done".

Social bookmarks: del.icio.us Digg DZone Reddit StumbleUpon
1 | 2 | Next »
Show comments (1)

Post a comment

Your name:
Your e-mail address (won't be displayed):
Your web site (optional):
example: www.xyz.com
Your comment:
Please help us prevent comment spam:

What's New?

2008-10-20 - I've added a new mailing list feature to the site. Sign up to receive e-mail updates about new articles.
2008-09-30 - We've released chapter 4 (User registration) and chapter 5 (Authentication) of Spring in Practice.
2008-09-11 - By popular demand, I've added an RSS feed to the site.
Home | Consulting | Tech Articles | Mailing List | About | Contact | Spring Blog
Copyright © 2008 Wheeler Software, LLC.