A holistic approach to reliability should incorporate reliability principles in the DevOps pipeline to form ‘DevRelOps’.
An evolving business landscape, increasing customer expectations and the ‘new normal’ created by the COVID-19 pandemic make time-to-market a critical parameter for business success today. While IT has always been business-critical, it is also mission-critical and life-critical in some industry contexts. With that in mind, the DevOps mindset has focused on feature velocity to bring in new features in shorter cycle times. However, while organizations need to improve the resiliency and security of their IT systems, the focus on reliability cannot be diminished.. In more recent times, instances of high cost of downtime are evident. For instance, in March 2019, a 14 hour outage for a social media company cost $90 M.
The past decade has seen a tremendous transformation in the IT landscape with the evolution of application and infrastructure technologies leveraging cloud, Internet of Things (IoT), artificial intelligence (AI) and so on. This has given rise to a situation where distributed systems have become the norm. While every organization strives to provide the best possible services to its customers, many a time, the misalignment between business and IT results in a disconnect on the specific service level objectives.
With digital as the core for every business, service reliability plays a crucial part in meeting the expected business outcomes. While the aspiration may be to achieve 100% reliability, it is almost an impossible feat, as the cost of providing five-nines reliability (99.999) is humungous, if not impossible.
SRE: An enhanced approach to meeting reliability requirements
|Key challenges faced by enterprises||
How this is addressed by SRE
Trust is the bedrock of businesses. Unreliable services can leave a permanent dent in the minds of the customer and specific actions need to be taken to improve reliability.
|Protect brand reputation, compliance and improve business trust.|
|While enterprises have elaborate SLAs in place, the end user experience is not always in sync with the promises made in the SLA. Reasons could include not taking a service-level objective (SLO) driven approach for providing services, organizational silos etc.||
Improved user experience with SLO driven methodology. Defining the right SLOs is extremely critical to establish benchmarks and set the right expectations of the stakeholders.
|Dashboards with ‘watermelon’ metrics provide diminished value, as the problems experienced by the users in the IT systems are not visible on traditional monitoring platforms.||
Reduce mean time to detect (MTTD) and mean time to respond (MTTR) with observability. Observability helps in enabling the right logs, traces and metrics as a feedback loop instrumentation of the systems is carried out on an ongoing basis by developers.
|Around 50% of the time and effort by IT operations are spent on manual activities, resulting in loss of productivity. Manual interventions also can lead to considerable delay and inconsistency.||With the emphasis on applying software engineering principles to IT operations, cycle time reduction with automation is an invaluable benefit.|
|IT is inundated with requests for new releases and bug fixes. However, there is an ongoing need to balance the priorities for new features versus the ones needed for establishing resiliency.||Establish the right business priorities.|
Wipro’s SRE Portfolio
Wipro’s Cloud Infrastructure Services focus on providing end-to-end services encompassing consulting, transformation, migration, and operational services across cloud environments in several industry verticals.
Wipro envisaged the need for SRE in the context of next gen IT ops early on and has created a portfolio in this area. We have seen tremendous uptick in the pilots around SRE across many of our clients with different levels of adoption maturity. In 2019, Wipro created an SRE team. The team was chartered to deliver SRE services across all life cycle phases (consulting, design, build, and operations) and resource domains (people, process, and technology). The portfolio includes consulting and technical services to help customers define, plan, prepare the organization, implement, and operate an SRE practice that meets their specific needs. What benefits customers most is the fact that Wipro is an end-to-end player in the enterprise IT space and has extensive capabilities in incorporating SRE practices at the business, development and operations levels.
To have a strong and focused approach, Wipro brings in a multi-work stream engagement model comprising a multitude of major service lines - cloud and infrastructure services, digital application services, cybersecurity risk services, and analytics; working in tandem with our business verticals such as healthcare, telecom, and utilities.
Wipro has around 20 different services in the SRE Service catalog. While some of them enable clients to jumpstart SRE, the others focus on specific dimensions of SRE such as automation, user experience, architectural design patterns, service level objectives, and others.
Illustrative set of SRE services
a) The Wipro Jumpstart SRE enables customers to put an SRE practice, Center of Excellence (CoE), mindset, and culture in place with reliability at the core. It starts with understanding the overall business context and criticality, establishing the right priorities, conducting a maturity assessment, and creating a high level plan for establishing an envisioned target state operating model. Depending on the specific needs of the customer, the focus areas would encompass understanding the service criticality for the overall business, establishing the right service level objectives with value stream mapping, arriving at the right governance model, and inculcating the SRE spirit across the organization and the SRE way of working. To build this further, we collaborate with client stakeholders to evaluate the progress and chalk out the next steps in terms of identifying the specific value streams to be taken up to arrive at the target state operating model.
b) SRE observability and monitoring: Wipro’s approach for SRE services is built on an ecosystem supported by two operating models:
Wipro’s strategy is to have a top down and grassroots approach to establish the effectiveness of IT system operations to business outcomes. Towards that end, we establish the right KPIs and SLIs to be monitored from a user experience standpoint. The right kind of observability setup helps us in monitoring the required SLOs and also helps in reducing the MTTD and MTTR.
|SRE value stream||
|Area of focus|
|Governance model Target state operating model definition management.||Engage with customer’s corporate reliability team OR establish the same. Connect with the CXO to determine the need for embedding SRE in all areas under consideration|
|DevRelOps||Reliability checkpoints at every stage of the pipeline||Application development and DevOps. Embed reliability as a critical parameter in every checkpoint|
|SRE led automation||Cycle time reduction||Toil estimation effort|
Monitoring & observability
|Reduction in MTTD and MTTR, customized and contextualized dashboards||
Achieve critical business insights and improve IT operations in incident management and blameless post mortem
|SRE design patterns||Architectural resiliency||
Review critical design patterns such as load shedding
Enterprise digital operations center
|SLO compliance and error budget estimations for continual feedback||Managed services|
Let’s take a look at how two organizations benefited from engaging with Wipro for SRE:
If you are interested in learning how Wipro is helping our clients to achieve their vision with a strong and focused approach to site reliability engineering, we should talk. Contact us here.
*Gartner, “Emerging Technologies: Site Reliability Engineering Delivered as a Service”, Craig Lowery, Brandon Medford, George Spafford, April 13, 2021.
GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved. Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
Santanu Kumar Patro
General Manager & Practice Head - SRE, Cloud Advisory and Consulting
Santanu Patro has about 20 years of experience in IT and close to 15 years’ experience in Advisory & Consulting. Being a techno-evangelist, he has been developing future ready solutions/services to meet the ever changing requirements of an enterprise, and leading a team of IT therapists to solve enterprise challenges.
Managing Consultant, CIS
Shyam Venkat is a seasoned IT professional with more than two decades experience in the field, specializing in cloud infrastructure, network transformation, IT strategy, service management, site reliability engineering, and cybersecurity. Based in Dallas (TX), He has performed multiple leadership roles in technology management, consulting, service delivery and operations across multiple clients across different industry verticals.