When an IT engineer pulled a wrong plug in the data center of a leading airline, little did he know that it would leave 75,000 passengers stranded, hundreds of flights cancelled and the airline would incur a loss worth more than $100 Million. How did it all start? who asked the engineer to pull the plug? Did he do an impact analysis? Why did the command center take more than a day to recognize the problem, take action and restore service? Could they have done any better? Were they prepared to deal with a human error like this?
The incident relates to a common fact that many organizations still run multiple command centers in silos with minimal information exchange. Some examples of silo command centers are network command center, applications command center, facilities command center, business command center, cloud command center, and datacenter command center. Each command center brings with it its own service owners, its own tools/processes, its own technologies and seldom can these interact and exchange data with each other.
This leads to a potential situation where a network command center will never know about a power outage in a data center, which can potentially affect the availability of servers, applications, and at the end of it all, the business. In the incident referred above, while the airline could not have prevented the human error, it would have taken much lesser time to find the root cause of the incident and fix the problem if the multiple command centers were integrated.
How can we address such problems?
Today data is flowing across the ecosystem at the speed of light. Sensors, mobiles, smart watches, smart buildings, air conditioners, servers and applications are all sending user data, consumer data, and device data. Collecting this huge amount of data, available across multiple sources and within command centers, at one place, analyzing it, processing the meaningful data and generating insights for the command center is the biggest challenge.
Companies of the future will be able to collect data such as consumer data, business data, IT data, facilities data, and marketing data; interpret the data, find anomalies and patterns, perform data correlation, and present it back to a single command center (which can have specialized groups such as network command center). This will improve the ability to predict outages, disruptions and take corrective decisions quickly and intelligently.
Building the next-gen command center
To enable a more intelligent next-gen global command center, organizations need to rethink the strategy around how they conceptualize command centers. Here are some of the building blocks that will frame the command center of the future (see Figure 1):
Data aggregation layer – This layer will comprise tools and technologies that will collect data from multiple sources like IoT devices, applications logs, security threats, cloud alerts, facilities alerts.
Event correlation layer – This layer will have the capability to process the data, filter the anomalies, remediate some of the common problems, contextualize the remaining alerts to a business application and assign it to the best available teams who can solve the problem quickly through cognitive capability.
Prediction layer – Artificial Intelligence will determine anomalies in the behavior of IT estate and pattern of faults to predict future outages and performance issues in this layer.
Service management layer – This layer will base on IT4IT reference architecture. Key aspects like Self Service, Service Request Management, Incident Management, Change Management, Configuration Management and Knowledge Management will continue to drive command centers.
A command center will have a 360º view of applications and all underlying components that potentially impact service availability.
Automation layer – This layer will take quick actions on the intelligence provided by the event correlation layer. Whether it is to execute a remediation script, to generate an outage alert to the Major Incident Management team to initiate quickly a fallback strategy, the automation layer will take actions that a normal human would do.
Cost management layer – This layer will manage the cost impact of changes in an enterprise. For example, how much does it cost to add or remove a server on cloud? What cost impact will it have on the business? What budgets are required to manage these changes?
Audit and compliance layer – This layer will ensure that the audit of the enterprise meets the objectives of the business. For example: audit rules will be setup to ensure IT/Facilities make no changes during critical business hours.
Intent based policy and configurations – Encouraged by ‘intent based networking’, companies will define intents within the command center instead of focusing on silo availability of infrastructure or applications. For example, a company will define an intent as, “ensure the transaction response on the web application does not go above two milliseconds.” For this intent, policies on all impacted underlying components like applications, servers, databases, networks would be defined. Also, search rules to learn about the user experience on the transaction page and interactions on social media would be defined. These policies will be irrespective of which tool is used to monitor and manage the enterprise. Through these building blocks, the companies will be able to achieve a command center that provides the following capabilities: