A Characteristics often forgotten: The Resilience of an ICT system

Recently in e-commerce Company, there was a disaster.

A senior programmer copied a program directly in the production environment. The system crashed. It took four days to restore service and reload the past data.
That was really a disaster. Of course, the poor programmer took all the blame and so his boss.
It was of course important to perform a root cause analysis. We do not intend to go into the technical details of the problem. We want to analyze in a more generalized way the root causes. In our opinion, the human error which generated the disaster was not really responsible for this issue. Certainly the disaster recovery process took too long to allow the restoration of the system. This was one of the cause of the poor service.
The root cause was a very low resilience of the application. It was an almost twenty year old application, worked and reworked over the years. It now featured more than 5,000 programs, almost 200 Isam files and roughly 100 interfaces. It was indeed a rather complex application. On the other side it was vital to the organization. In the e-commerce world you cannot accept a four days RTO, which is the recovery time objective. RTO is the targeted duration of time and a service level within which a business process must be restored after a disaster in order to avoid unacceptable consequences associated with a break in business continuity.
What was then the root cause? It was the low resilience of the application. Certainly it was old, but in the several years nobody had really bothered to understand its resilience.
So what is resilience? In general, resilience refers to the ability of a material to deform elastically, to bend opposite to a stress without becoming distorted permanently. This is exactly what organizations need to meet the growing challenges of socio-economic environments. Resilience in the case of ICT applications is the ability of a service provided to adapt to the conditions of use and to resist external events so as to ensure the availability of the services to be provided.
Not many ICT professionals evaluate the resilience of a system. They care about performance, such as response time, usability, in the case of customer applications. Few care about maintainability or flexibility. Almost nobody cares about make an application resilience and test it.
On the other side, many IT departments care about disaster recovery. The problem is that in this way they attack the devil from the tail. They should work on the root causes. It is interesting that one central bank has recently issued a strong recommendation on disaster recovery and business continuity. They go on even mandating “at least” a yearly test. No word has been spent on building a resilient system and how to test it. The old rule is that prevention is always better than restoration.
It is important to make systems resilient. This means to evaluate all the risks, their probability of occurrences, their impacts, and their possibility of anticipating them. That is it is necessary to consider the probability of a scenario, its consequences if scenario  occurs, and the possibility to anticipate the occurrence of the scenario.
One method to measure the reliability of an application exists and must be borrowed from the industrial world, where managers are very concerned about the reliability of their products and their processes in various phases of their life cycle. The name of this tool is FMEA (Failure Mode Effect Analysis). FMEA. Sometimes FMEA is extended to FMECA to indicate that criticality analysis is performed too.
Failure mode and effects analysis (FMEA) is one of the first systematic techniques for failure analysis. It was developed to analyze problems that might arise from malfunctions of military systems. An FMEA should be the first step also of a system resilience study. It involves reviewing as many components, assemblies, and subsystems as possible to identify failure modes, and their causes and effects. For each component, the failure modes and their resulting effects on the rest of the system are recorded in a specific FMEA worksheet. An FMEA can be a qualitative analysis, but may be put on a quantitative basis when mathematical failure rate models are combined with a statistical failure mode ratio database.
A successful FMEA activity helps to identify potential failure modes also in ICT systems based on experience with similar systems. Effects analysis refers to studying the consequences of those failures on different system levels. Functional analyses are needed as an input to determine correct failure modes, at all system levels, both for application FMEA or a hardware FMEA. An FMEA is used to structure Mitigation for Risk reduction based on either failure (mode) impact severity reduction or based on lowering the probability of failure or improving the capability to anticipate the occurrence of a failure, or all three. The FMEA is in principle a full inductive (forward logic) analysis. The failure probability can only be estimated or reduced by understanding the failure mechanism. Ideally this probability shall be lowered to “impossible to occur” by eliminating the root causes. It is therefore important to include in the FMEA an appropriate depth of information on the causes of failure (deductive analysis).
It is interesting to see that in the development of new applications ICT persons do care about resilience. They do that mainly from the security point of view. ICT uses “good” hackers to see if they can penetrate the application: the so-called pen tests. It is not the full resilience, but at least ICT is moving in the right direction.
So we started blaming the poor programmer which copied the wrong file. We end up blaming once again the programmer. This time the one which built the application and the architect/analyst which designed it.