CIOs today are challenged with meeting business expectations to perform on par with consumer-facing technology leaders like Facebook and Netflix—with always available and zero latency performance. Of course, these expectations are not aligned with the realities of managing complex hybrid enterprise systems that were not necessarily designed to deliver this type of performance.
The new generation hybrid enterprise has to deliver the functionality of, and operate as, a real time infrastructure. It is also built on microservices and cloud native platforms and is inherently designed to fail. This is what the IT Operations is tasked with supporting.
Digital enterprises that use such hybrid environments and must respond in real time need to achieve incident avoidance, rather than incident management.
But most operation command centers today struggle to keep up with the required pace. Very few operations have been built from the ground up, which is why new age platforms like Facebook and Netflix are in a different league altogether. Today Severity 1 incident resolution times still run for hours with a large group of SMEs on bridge calls, with no root cause to be found and each merely looking to avoid blame. Lack of a proper CMDB, insufficient logging, or lack of application discovery and dependency mapping are the culprits. Every IT professional recognizes these failings, yet projects to correct them never seem to pass the capital appropriation business case hurdles.
A new approach is needed. We propose a set of design principles for an operationally simple, scalable, and focused solution that brings IT and business process owners to a common understanding.
Design principle #1: Simplicity is key—focus on progress, not perfection.
To ensure simplicity, start with a clear understanding of the services, transactions and functions. The conventional wisdom stipulates that we start with a robust and expensive infrastructure and application monitoring that will gather significant amounts of system and application data. However, when we cast the net wide, we fail to consider that this data is in itself irrelevant until and unless there is a disruption in a business service.
Similarly, striving for progress and not perfection stipulates that we accelerate our rate of learning rather than seek to perfectly maintain every system or subsystem.
The first design principle is to capture the business service heartbeat. In a clinical sense, we seek to uncover the smallest atomic transaction/service component. Building our monitoring on this foundation allows us to circumvent the complex and often expensive CMDB and ADDM projects. Depending on the industry, this heartbeat might be customer orders, service requests or vendor purchase orders. The other business processes are subordinated to this heartbeat.
Design principle #2: Shift from monitoring events to raw wire data (from thresholds to live streamed data).
As operations experts, we spend much of our careers delivering complex systems built on the installation and configuration of multiple tools and application instrumentation (logging and defining policies about data retention for each tool). We then struggle to perform the real magic of root cause identification and analysis to configure alerts based on defined thresholds.
Fundamentally, what we want to measure is dynamic, but how we measure it remains static. This needs to change.
Our second design principle defines the way to change this and looks at streaming analytics-based approaches as shown in the figure below.
We should take a heartbeat of critical business processes, and provide real time analytics that will change the way our knowledge workers approach problem solving. Our static, threshold-based models have created a generation of IT professionals who wait for the incident rather than anticipate problems. This premise needs to change.