“Change” is one of the key areas from a SRE perspective and it is important to understand how to apply Artificial Intelligence (AI) to handle change better. Typically, in any environment, the maximum issues are change related and most of the times, it becomes difficult to identify what change caused which issue. But now, with the application of AI, the site reliability engineer can easily determine whether an issue is related to a change or where a change in an environment is authorized or unauthorized, or even if two environments say – production and staging - are in sync or not, thus enabling the engineer to be more proactive, efficient and able to quickly solve issues arising due to change.
A typical SRE or the team of SREs need to understand both the nuances of development and operations to effectively manage large complex environments. The typical skill sets for both the development andoperations are different, and hence it is important for the SREs to be cross-skilled in both the areas to enable them in managing the environments, adhering to the organization processes and ensuring the health of the landscape.
3. AI Ops
The application of AI in operations has become imperative so as to move away from the current reactive way of solving IT landscape-related issues to become more proactive, predictive and preventive, thereby improving the resilience and reliability of the overall environment – across apps and infra. The new “ops” have to be made as agile as the development process, with the SRE having tools to identify problems early, reduce noise, get clear root cause analysis, enable self-heal, collaborate, have a single pane of glass across the landscape and have the ability to remediate any issue quickly using adaptive remediation solutions – ensuring that the SRE gets the right information at the right time, without much effort and ensuring the health of the landscape. The existing solution / tools in the landscape, need to be revisited, re-configured or extended to support this “new” agile ops.
Typically, in operations today, there are multiple repetitive, manual and mundane tasks that the support teams need to do on a daily basis. These tasks are varied and can range from support-related issues to daily/weekly /monthly reporting – all of which typically take a lot of time and effort, and also impact the productivity of the support team. Now with the goal of moving towards Zero Touch Ops – wherein the goal is to spend zero effort on operations, these activities need to be automated. In fact, anything that is manual, repetitive, and has the potential to be automated, should be automated – thus resulting in more productivity and efficiency of the SRE.
Overall, focusing on the above four key areas, will help SREs to manage the environment efficiently and adhere to the stringent Service Level Agreements (SLA) based on the Service Level Indicators(SLI) and Service Level Objectives (SLO) – which form the crux of Site Reliability Engineering.
In summary, organizations will have to embark on this journey to embrace SRE or a similar approach in order to scale up their operations and to achieve the required level of agility in their IT landscape that can match up with their development lifecycle.