DevOps and Resilience
This final part of the series summarizes the role of resilience in the DevOps life cycle – which prescribes a partnership between those who develop the systems and those who understand the implications of designing for resilience - and summarize the discussion overall.
While the basic premise of DevOps is all about getting new capability deployed faster, that’s only part of the story. DevOps is the continuous transformation of the whole lifecycle to remove bottlenecks and inconsistencies, end-to-end, including Operations. DevOps implies (requires!) collaboration between Development and Operations so that new functionality is deployed both quickly and flawlessly, and that it meets the needs of the business the first time. Technology is helping lower the barriers between Application and Infrastructure, but it’s going to require people collaborating across the IT supply chain to realize this vision. This highlights another aspect of the transformation needed to enable effective use of cloud. Those skilled resources who used to be applied in silos, including resilience expertise, are going to need to work together, continuously, with Design and Development, and Deployment teams to realize the full value of The Cloud. The idea that resilience gets designed in after the application has been developed simply won’t work.
But, the good news is it should get easier, not harder, as the realization of Software Defined Infrastructure moves closer and system design transforms from monolithic systems with marked division between Application and Infrastructure, to “thinner”, vertically-integrated services, as illustrated in the graphic below. While there will be more moving parts to coordinate, each component should be smaller, less complex, infrastructure-independent, and able to control its own resilience.
What about re-purposing non-production environments for Recovery?
As part of their DevOps transformation, many organizations have been using cloud for development and testing for a while, so there is practice in non-production environments being built and taken down and refreshed over and over. Some are thinking to repurpose parts of their development/test environment for recovery, thinking that would solve the problem of making sure the environments are in sync between production and non-production, that the automation would be tested, and that incompatibilities would be ironed out.
Over the years many organizations have implemented a similar approach, way before cloud. This approach can work well, especially when the environment to be re-purposed is the one most closely aligned with production. Assuming you can afford to stop testing and deploying new functions while you’re testing or executing recovery, and that the QA environment and process isn’t critical to running the business, it can work. But, even this seemingly safe concept has its considerations.
If you’ve leveraged The Cloud for development and testing, including pre-production, you may need to consider whether your cloud environment can meet the same stringent security, audit, and compliance requirements as your production environment. If the whole system, all environments, is in The Cloud, did you design them to the same specifications, or did you scrimp a little in non/pre-production to save costs?
You should plan that whatever facilities you need in production you’ll very likely need in your recovery environment. While you might be able to get by with reduced scale, you’re not likely to get away with cutting back in areas like security and compliance. For example, if your non-production environment relies on data redaction or obfuscation to be considered “safe”, is that same environment also ready to support production workloads, or are there other limitations that you overlooked because the data wasn’t a risk? If you use encryption and/or dedicated infrastructure to support your database servers in production, for example, do you have those same facilities available in non-production if you were planning to use that environment for recovery?
Further, if part of your non-production environment (i.e. QA) is temporarily unavailable because you’re using it for recovery, can you really afford to stop testing and deploying updates? What if one of those updates is some critical patch or mandated fix, perhaps even the one that caused the outage? What are the risks of not being able to test it before you drop it into what would then be production? If you break the repurposed non-production environment that you are now using for recovery during some sort of an emergency, then what? Have you ever seen what happens when you put one emergency on top of another? Chaos ensues. Might be worth the risk, but know what you’re getting into.
The great thing about Cloud is that you should be able to adjust things quickly and easily, and with stipulations sited previously, get additional resources when you need them to implement your resilience strategy. With planning and automation you should be able to create fast and efficient recovery in The Cloud. Many do. Automation is one of the key ingredients in DevOps and in taking advantage of Cloud. Implementing resilience is no different. Do you have a spot on your roster for automation specialists? You’re definitely going to need them.