In today’s world, with accelerated digital transformation, remote work and businesses functioning in multi-and-hybrid cloud environments, the complexity of the IT landscape is increasing exponentially. And it is only set to continue. The rapid rate of growth and complexity is making IT operations more challenging even for large high-tech companies. For example, Facebook lost about $65 million in revenue during a 5-hour outage. While the scale of operations varies by the firm and the industry, managing IT operations robustly, pre-empting and mitigating IT downtime is critical for brand reputation, revenues, customer experience and employee productivity.
For the past few years, many companies have begun to rely on ‘AIOps’ tools and platforms to monitor their IT operations, identify issues in real-time and remediate them much faster than humans. However, despite the abundance of AIOps technologies in the market, more than half of enterprise IT leaders confirmed they are experiencing an increase in IT downtime since March 2020. This indicates IT outages and brownouts continue to be widespread, despite the adverse business repercussions. To address these outages, IT leaders should focus on five key areas to achieve resilient IT operations.
1. Single Unified View of Entire IT Landscape
According to a recent BCG survey with global CIOs, “51% of organizations are wrestling with more than five monitoring tools and only 7% of companies confirm they have a single view of the entire IT infrastructure”. This means that IT monitoring professionals must move from tool to tool to obtain a connected and holistic view of the landscape. The complexity is only going to grow due to an increase in remote workloads, Internet of Things (IoT) traffic and cloud workloads that make efficient performance management difficult.
The true challenge faced by IT teams is to have a comprehensive perspective of the complex, interconnected IT infrastructures in which the majority of businesses currently operate. This necessitates the need for a single unified platform capable of monitoring the entire IT landscape.
2.Real-time is not Enough – Predictive Capabilities are a Must
The modus operandi of the majority of AIOps platforms is to monitor events that appear anomalous, process them to determine root cause via correlation, trace the event through the system landscape to identify other affected systems, confirm the root cause and package this information to notify the relevant user. AIOps platforms are set off by events that they have been designed to distinguish as typical vs unusual or dependent on threshold values. As a result, the platform becomes dependent on some real-time occurrence that signals the onset of a new issue.
AI-based predictive analytics capabilities, enable this process to be more pre-emptive, preventing critical service outages and keeping the mean time to failure (MTTF) in control. Furthermore, predictive analytics algorithms have been enhanced significantly to increase their sensitivity to trends, reduce event noise generated by monitoring tools, minimize false positives and highlight actual risks.
3.Reduce Time to React and Respond
Correlation is not causation. When something goes wrong in the IT infrastructure, the most crucial step is to determine the underlying reason so that remediation may occur more quickly. While correlation (if an ‘X’ network goes down it impacts ‘Y’ service), is a useful approach to understand how different components in IT infrastructure behave, it is not a fool-proof method. Correlation leaves all linked parameters open to chance, which may result in unintended consequences and further delays in restoring the infrastructure to an appropriate level of serviceability.
AI-based advanced causality algorithms remove the need for correlation or pre-determined knowledge of the IT landscape to find the root cause. They use interactions between the components at a period before the predicted risk to understand how the components were interacting with each other. This gives a data-driven view of the root cause of risk, irrespective of how the system is designed. How the system components interact with each other gives us the insights needed to perform the root cause analysis. The mean time to resolve (MTTR) comes down drastically when the system does not spend time tracing the architecture to understand the issue.
4.Don’t Overlook Last-mile Observability
Full-stack observability is one of the most popular buzzwords in AIOps right now. As the It landscape becomes more complex, an increasing number of AIOps platforms are consolidating their capabilities across domains such as servers, networks, applications and cloud to enable end-to-end visibility and gain an interconnected understanding of the entire landscape.
The piece of this jigsaw puzzle that has been overlooked is last-mile observability, which comprises the customer experience, the digital experience, device telemetry and endpoint data. 68% of CIOs say they face significant challenges when it comes to digital experience monitoring. Bridging this gap is critical for contextual visibility and for businesses to unlock the full potential of AIOps.
5.AIOps Interventions Should not be Constrained to Capacity Planning
IT Systems undergo planned changes regularly such as the addition of a new server, a load balancer being introduced, the disk being removed and so on. The site reliability engineer (SRE) may already be aware of the operational risks associated. The majority of the AIOps platforms currently focus on understanding these interventions, predicting the consequences and preparing for them in advance. This should not, however, be confined to capacity planning alone.
Leveraging AI-based counterfactual simulations, companies can predict the effect on the system in case a hypothetical change takes place such as downtime, deployments and activities, etc. This enables businesses to visualize the impact of strategic decisions on the system and plan accordingly. AI-powered interventions help IT teams plan for the future and minimize disruptions that have the greatest impact.
John Dsouza
Engineering Head, Wipro Holmes, Wipro Ltd.
John has over two decades of IT experience, including several years as an Enterprise Architect for many Wipro customers. He has extensive experience working on data engineering, artificial intelligence, and machine learning solutions. Currently he leads engineering for AI products focused on providing Resilient IT Operations.