Today, an increasing number of organizations are adopting agile methodologies in their development lifecycle, thereby transforming and accelerating the whole development process, but that is not the case with operations. While, the development teams have all the tools to support this accelerated agile development process, the operation teams often find it difficult to keep up with this pace, resulting in increasing operational issues in the application landscape. Adding to this the complex nature of environments – more heterogeneity, from a mix of modern cloud native applications to legacy mainframe applications, and everything in between – is leading to operational challenges. It is important to transform operations, in a similar scale as the development processes, to ensure that the landscapes are resilient, performant, robust and make the whole IT operations more “agile”. This article discusses how Site Reliability Engineering (SRE) can help achieve this symmetry.
What is Site Reliability Engineering?
The Site Reliability Engineering (SRE) was conceptualized by Google nearly more than a decade ago. Google wanted to build world-class solutions that are performant, resilient and robust. One of the challenges that they saw was that while the development was able to scale up pretty quickly, there was an issue with the hand-off to operations, since operations were not ready to handle this high frequency of change in the application landscape. Another issue was that the development team could push any code into operations, and then the responsibility to manage the application, was pushed to the operations, with little or no accountability of the development team. The Google team started to think of this operations problem as a “software problem” and realized that they needed to build a cross-functional team that is responsible for both development and operations alike – thereby solving this hand-off issue and bringing in more efficiency and effectiveness. In essence, these site reliability engineers (SREs) ensure that they spend maximum time on development, while using tools and solutions to proactively monitor, predict and prevent issues in operations, and also automate most of the manuals tasks – thereby striving to move towards a zen state of “NoOps” or Zero Touch Ops.
Four Key focus areas for SRE
There are four key areas of focus to embrace SRE as a part of operations.
Figure 1 : Four Key Focus Areas for SRE
1. Change
“Change” is one of the key areas from a SRE perspective and it is important to understand how to apply Artificial Intelligence (AI) to handle change better. Typically, in any environment, the maximum issues are change related and most of the times, it becomes difficult to identify what change caused which issue. But now, with the application of AI, the site reliability engineer can easily determine whether an issue is related to a change or where a change in an environment is authorized or unauthorized, or even if two environments say – production and staging - are in sync or not, thus enabling the engineer to be more proactive, efficient and able to quickly solve issues arising due to change.
2. Skills
A typical SRE or the team of SREs need to understand both the nuances of development and operations to effectively manage large complex environments. The typical skill sets for both the development andoperations are different, and hence it is important for the SREs to be cross-skilled in both the areas to enable them in managing the environments, adhering to the organization processes and ensuring the health of the landscape.
3. AI Ops
The application of AI in operations has become imperative so as to move away from the current reactive way of solving IT landscape-related issues to become more proactive, predictive and preventive, thereby improving the resilience and reliability of the overall environment – across apps and infra. The new “ops” have to be made as agile as the development process, with the SRE having tools to identify problems early, reduce noise, get clear root cause analysis, enable self-heal, collaborate, have a single pane of glass across the landscape and have the ability to remediate any issue quickly using adaptive remediation solutions – ensuring that the SRE gets the right information at the right time, without much effort and ensuring the health of the landscape. The existing solution / tools in the landscape, need to be revisited, re-configured or extended to support this “new” agile ops.
4. Automation
Typically, in operations today, there are multiple repetitive, manual and mundane tasks that the support teams need to do on a daily basis. These tasks are varied and can range from support-related issues to daily/weekly /monthly reporting – all of which typically take a lot of time and effort, and also impact the productivity of the support team. Now with the goal of moving towards Zero Touch Ops – wherein the goal is to spend zero effort on operations, these activities need to be automated. In fact, anything that is manual, repetitive, and has the potential to be automated, should be automated – thus resulting in more productivity and efficiency of the SRE.
Overall, focusing on the above four key areas, will help SREs to manage the environment efficiently and adhere to the stringent Service Level Agreements (SLA) based on the Service Level Indicators(SLI) and Service Level Objectives (SLO) – which form the crux of Site Reliability Engineering.
In summary, organizations will have to embark on this journey to embrace SRE or a similar approach in order to scale up their operations and to achieve the required level of agility in their IT landscape that can match up with their development lifecycle.
Nikhil Mehta
Lead Architect -HOLMES
Nikhil is the Product Owner of HOLMES AIOps in the "HOLMES for IT" product line. With 16+ years of industry experience, he has played a pivotal role in setting up and developing AI product lines on the Wipro HOLMES AI platform. He has successfully delivered AI-enabled IT transformation that were focused on responding to IT operations, improving performance and supporting growth. Nikhil has played various roles across multiple business functions like Product Engineering, Product Management, Program Management and Service Delivery; with an expertise in defining, building and deploying solutions across multiple domains such as Logistics, Telecommunication, Supply Chain and Retail.