In IT service management (ITSM), downtime can translate into significant losses for the company, quickly resolving the root cause of incidents is critical for your business’ success. ITIL (Information Technology Infrastructure Library) Root Cause Analysis (RCA) is a systematic approach designed to uncover the underlying issues behind IT service disruptions. The frameworks, methodologies, principles, and techniques center on the premise that it’s more effective to solve for and systemically prevent issues (i.e., stop them from occurring again) rather than just putting each fire out.
This blog post dives into the intricacies of ITIL RCA, its methodologies, and its significance in maintaining robust IT infrastructures.
Understanding ITIL Root Cause Analysis
At its core, ITIL RCA is a structured method used to determine the fundamental reasons behind incidents and problems within an IT environment. Unlike superficial fixes that merely address symptoms, RCA aims to prevent incident recurrence—enhancing the overall system reliability.
The core of RCA centers on:
- Remedying the root cause of an IT issue, rather than just solving the symptoms to provide short term relief
- Identifying how the issue can be prevented in the future
- Placing a focus on the How and WHY, not the Who for the issue
- Finding concrete evidence to back up any root cause claims
- Providing information to inform on what the best course of action is to resolve the issue
3 Benefits of RCA in IT Service Management
- Preventive Maintenance: By identifying root causes, organizations can implement preventive measures to mitigate future incidents. This proactive approach minimizes downtime and boosts operational efficiency.
- Continuous Improvement: RCA fosters a culture of continuous improvement within IT operations. By analyzing past incidents, teams can implement corrective actions and refine processes, leading to enhanced service delivery and customer satisfaction.
- Cost Reduction: Resolving recurring incidents through RCA reduces the need for reactive support and emergency fixes, ultimately lowering operational costs and optimizing resource utilization.
3 ITIL RCA Methodologies
There are multiple well-known methodologies used to conduct RCA. Below are 3 of the most popular methods and frameworks—used across various industries. Try all of them and see which one best fits your needs and preferences.
Fault Tree Analysis (FTA)
Fault Tree Analysis (FTA) is a top-down approach that visually represents potential causes of a specific incident and examines the undesired state of a system. The system was originally developed by H. Watson and A. Mearns in Bell laboratories for the Air Force in 1962. It was later adopted by Boeing and is now used by companies in the aerospace, chemical, and software industries for reliability events. By systematically breaking down events into contributing factors, FTA helps identify the root cause (the undesired outcome is taken as the root of the logic tree) and its dependencies. The fault tree is typically written out using logic gate symbols. The basic symbols used in FTA are events, gates, and transfer symbols.
FTA Event Symbols
- Basic event – failure or error in a system component or element
- External event – expected to occur
- Undeveloped event – an event for which insufficient information is available
- Conditioning event – conditions that restrict or affect logic gates
FTA Gate Symbols
- OR gate – the output occurs if any input occurs
- AND gate – inputs are independent from the source (output happens regardless)
- Exclusive OR gate – the output occurs if exactly one input occurs
- Priority AND gate – the produced output occurs only if the inputs occur in a specific sequence, that are specified by a conditioning event
- Inhibit gate – the expected output occurs if the input occurs, though only under an enabling condition specified by a conditioning event
FTA Transfer Symbols
The transfer symbols, “Transfer in” and “Transfer out” are used to connect the inputs and outputs of fault trees.
5 Whys Technique
The 5 whys Root Cause Analysis method is based in the idea of asking “why” multiple times to trace problems back to their origins. The technique encourages IT teams to delve beyond superficial explanations and uncover deeper underlying issues. It also helps you to avoid assumptions and focus on what has occurred.
How to use it:
- Ask a question about “why something happens within your software” or “why your product does x instead of y?”
- For every answer to your WHY question, ask another, deeper “Ok, but WHY?” question.
TIP: A good way to think about this is to imagine you’re talking to a curious child, who’s being slightly annoying and keeps asking you, “Why?” after you explain something to them. If you’re getting annoyed at the amount of whys you’re asking, you’re on the right track. The more you ask “why” and uncover all the intricate parts of your IT infrastructure, the better you’ll be at finding issues and resolving them to better your security/ product.
Example
Question |
Answer |
Why is the application running slow for users? |
The server hosting the application has high CPU usage. |
Ok. Why is the CPU utilization so high? |
There is a sudden surge in concurrent user logins. |
And why is there a surge in user logins? |
A new marketing campaign launched without IT input. |
Why didn’t IT know about the campaign? |
There’s a lack of communication between teams. |
Ok, and why is communication lacking? |
No formal process exists for project impact analysis. |
As you can see, this makes for a useful informal method to push teams to dig a little deeper than the initial symptoms to figure out what is going on. At the beginning, it will make sense for technicians to try and deal with high CPU usage, but without understanding why that is happening in the first place, we would never conclude to resolve the actual problem, which in this case is lack of a submission process to analyze the impact of projects.
Ishikawa (Fishbone) Diagram
The Ishikawa diagram, also known as a cause-and-effect diagram, categorizes potential causes of a problem into major groups, such as people, process, technology, and environment. This visual tool eases collaborative analysis and holistic problem-solving.
How to use it:
- Start with the problem in the middle of the diagram (the spine of the fish skeleton)
- Brainstorm several categories of causes (placed in off-shooting branches from the main line, the ribs of the fish)
- Group the categories and break them into smaller parts (e.g., “People” might be a potential root cause factor of “training”)
- Dig deeper into potential causes and sub-causes – question each branch to get closer to the root issue at hand
- Eliminate unrelated categories and identify correlated factors (i.e., root causes)
Common Categories to Include:
- Machine (equipment, technology)
- Man/mind power (physical or knowledge work)
- Mission (purpose, expectation)
- Management / money power (leadership)
- Product (or service)
- Price
- Process (systems)
- People
How to Implement Effective RCA Practices
With effective RCA practices in place for your IT service management, you’ll be able to diagnose and address any IT-related problems proactively—potentially saving your organization hundreds of thousands, or even millions of dollars. The three steps below outline the overview of what best practices are recommended to successfully implement RCA in your organization.
- Establish Clear Procedures: Define your company-wide standardized procedures for conducting IT root cause analysis. Be sure to outline roles and responsibilities within the RCA team and set up clear criteria for prioritizing incidents based on their impact and frequency.
- Encourage Collaboration: Foster open communication and knowledge sharing among RCA teams to gain diverse perspectives and insights.
- Document Findings: Document root cause analysis findings in a centralized knowledge base— including identified root causes and recommended actions. This repository will serve as a valuable resource for future reference and will facilitate organizational learning.
ITIL Root Cause Analysis is a cornerstone of effective IT service management, enabling organizations to diagnose and address underlying issues proactively. By adopting structured RCA methodologies and fostering a culture of continuous improvement, businesses can enhance operational resilience, reduce costs, and deliver superior services to their customers. Embracing RCA is not merely about resolving incidents; it’s about cultivating a mindset of problem-solving and innovation that drives long-term success in the ever-evolving landscape of IT operations.
Our 2024.1 product release includes root cause analysis, digital accessibility, automated IT asset discovery, and enhanced AI capabilities updates. EV Discovery’s Discovery & Dependency Mapping (DDM) roadmap will help customers gain a 360-degree view of their IT landscape; automate asset and configuration management; track changes and maintain audit trails; and seamlessly integrate with EasyVista’s ITSM products— additional dependency mapping features are expected to roll out later in 2024.