
From a technical point of view, there are lots of best practices around reducing the number, severity and scope of system outages. For example, we might use redundant system components with automatic failover, try to make systems integration points bulletproof, automate deployments to reduce or even eliminate release windows, isolate components of the system from one another, and so on. While I won't call those things "easy", they are in many cases solved problems and it's just up to people to learn the techniques and apply them. We know how to set up Oracle RAC, how to deploy a pair of redundant load balancers, how to implement retry logic in an application, and how to serve up a web page even though one of the data sources is unavailable.
From an IT management perspective, however, the challenges around system availability are of a much different sort. Ideas that seem obvious to someone coming from a software development background will often directly contradict the ideas that seem obvious to somebody from infrastructure and ops. And there may not be any quick way to resolve those differences, especially when they stem from deep-seated differences in management philosophy. When people from product development, internal-facing software development, infrastructure and operations, and tech support get together, the ideas on simply how to approach the problem can easily lead to some pretty heated (and often entertaining) "discussions". The issues go well beyond the purely technical:
Despite the fact that it's easy for any individual person to come up with obvious-sounding answers to questions like the ones just listed ("Hey, how about if everybody uses <insert_favorite_tool_here>?"), it's not nearly as easy once you have hundreds of people involved, hundreds of systems, thousands of servers, and leaders with different backgrounds and experience, different management approaches, different levels in the organization, and sometimes conflicting motivations or potentially adversarial organizational relationships. (For example, if there's an outage, developers and operations start pointing the finger at each other; the database team blames the storage team, the storage team blames the developers for being wasteful with disk, etc.) People make mistakes and may hide them. These and myriad other challenges contribute to system availability being "easier said than done".