“Do not look where you fell but where you slipped.”
This proverb is a very significant piece of wisdom. What it means is that to find the answer to your problem you should not only look where it lies but also look at the source of the problem. In today’s world we all face problems and most of them cannot be avoided but what can definitely be avoided is the recurrence of the same problem. To avoid encountering the same problem again, the reason why the problem came up must be identified but it is the human tendency to rectify the obvious effects of that problem only. The approach or method used to look for the root causes is called Root Cause Analysis, popularly known as RCA.
This blog post will provide the definition of root cause analysis, principles and types of causes. Even though the concept of RCA has a universal application I have written this post keeping in mind the software engineering perspective only. But even then it has been written in such a way that someone with no software engineering background can understand it.
In today’s world software development is getting more and more complex. Many software projects result in the failure due to improper requirements and those which any how succeed are not always perfect. There is always a need to solve the errors in the software. Finding and fixing the errors or faults in a software is not an easy task and it is very challenging to find the errors that are critical to software functioning. Very often the software engineers tend to look for the errors that are obvious. And in most of the cases the steps taken to remove these problems are not enough to eradicate the problem. The problems which are obvious are removed, but what happens is that the same problems return back later to haunt everyone including the designers and developers, and mostly importantly the stake holders.
What is needed is a way of eradicating the problems in such a way that the problem which is once solved should not occur again. This is normally accomplished by using the Root Cause Analysis, which is popularly known by its three letter acronym RCA. RCA is a method in which the main focus is to identify the root causes due to which the error occurs or the problems come up. RCA deals with determining what happened, what the reason for its occurrence is and how to reduce the possibility of recurrence of the same kind of the problem. This method works on the belief that if the problems are to be removed fully, you must concentrate on the root causes of those problems. Eliminating or correcting the root causes of any problem tends to solve the problem in the long run as opposed to correcting only the obvious symptoms which does solve the problem but it is bound to happen again and it could come up in even worse form.
RCA is a process that is both iterative and reactive. By iterative I mean that RCA has to be repeated again and again certain number of times as doing it once normally will not solve the problem fully and stop it recurrence. Hence RCA is a tool that works on the principle of continuous improvement. By reactive I mean that RCA is done once an event has occurred. Using this technique once the desired level of expertise has been achieved , the method becomes pro-active, which means that RCA can then be used to forecast whether there is any possibility of an error or problem to occur. Such an approach will help in the prevention of future occurrence of same kind of the problems.
Principles of RCA:-
· 1. The main aim of the RCA is performance improvement by removing the root causes of a problem as it is more effective than removing the obvious symptoms.
· 2. For RCA to be effective, a systematic approach must be followed. The conclusions are drawn about the causes (or problems), they must be documented as it will provide the evidence for all the causes.
· 3. Normally for a given problem there can be one or more root causes.
· 4. Casual relationships between the root cause (or causes) and the defined problems must be established for the RCA to be effective.
Three basic types of causes:-
- Physical Causes: - The tangible or material items such as motherboard, hard disks etc could have failed.
- Human Causes: - The people working on the system did something wrong like inserting the wrong values or not following the operation procedures properly.
- Organizational Causes: - This includes the policies or the process of the organization which sometimes contributes to the causes. For example no one was responsible for the regular check up of the power supply to the database servers.