Imagine you are the IT manager responsible for managing your customer’s on-demand video service. The application has module for purchasing various subscriptions, authentication, content catalogs, user data and the content files. Users may use a multitude of devices to access this content from anywhere in the world once the authentication is established. The application uses the standard three-tier architecture that has been running fine in IaaS mode on cloud. A recent event that led to a sudden spike in demand for VOD services has made the customer rethink the whole application architecture. The application designers propose a true cloud-native architecture with components that utilize Platform-as-a-Service (PaaS) in public cloud. The customer likes the idea since it gives them the opportunity to add rich features to their application, enhance the user experience by providing custom recommendations enabled by analytics, and align their capacity and spend with the demand. The proposed architecture will have services like Containers, API gateway, NoSQL DB’s, Cognito, Lambda functions, RDS, Mobile Analytics, SQS, SNS and Elasticache. You still have to manage the service, however, instead of well-established server-centric parameters, you now have event logs, file data and process information to create a map of your application health.
This is now a common problem service providers face when customers opt for application transformation in serverless and microservices architectures. The distributed, message-driven and loosely coupled application architecture of cloud services means the way we look at service management needs to move beyond the server-centric approach.
How is cloud monitoring different
The cloud-native transformation of an application could enable it to scale from few container or Lambda instances to few thousand and back within a short time, generate data that is a mix of structured and unstructured, and provide customers the power to use true pay-per-use model. The platform services themselves are highly available and data backed up in multiple places by default, so the up/down status monitoring or traditional redundant architecture is not required. For the IT manager, this poses problems of choosing which services and parameters to monitor, which tools to use, how to bring it all together in a unified application view and, measurement and charging for services delivered. Some of the challenges in implementing this approach are:
- The service management is built on two pillars of availability and performance management. For a distributed application, these factors are a derivative of many independently running services, only sharing data with each other. The communication is enabled with message queues, event streams and subscriptions to topics. The operations teams need to build an understanding of this application architecture, upstream vs downstream communication and services interaction to generate the service map.
- Identification of important services and what to measure is the next big change. This needs to be driven by a careful trend analysis of the problem records to find the most error-prone components and performance bottlenecks. The data points collected from each service also need to be correlated with each other to ensure the true service status is revealed.
- The parameters that need to be monitored are very different from the IaaS environments. From CPU and memory utilization in IaaS model, you move to ‘NumberOfEmptyReceives’ and ‘ApproximateNumberOfMessagesDelayed’. Translation of these parameters to accurately capture the important metrics that affect availability and performance is the next challenge.
- The data collected is mostly in the form of logs or event dumps that needs to be aggregated from various sources, cleaned up and analyzed to measure service performance. In the IaaS models, the tools are now mature and perform similar services out of the box. For microservices, this requires coding, scripting and integration of various tools to generate meaningful alerts and reports.
- Auto scaling poses its own challenges due to rapid scale up/down and the need to manage it in an automated fashion. Any service monitoring and management solution in such an environment should have the ability to handle these fluctuations in light of agentless endpoints.
- The customer expectation of pay-per-use now also extends to how the service providers charge for these services. The service management bill is expected to align with how the services were consumed in the public cloud and not a flat per RU per month model.
Service operations need to make some fundamental changes in the approach to provide the true and integrated picture of the entire IT estate. This requires rethink of our traditional ideas of application architecture and the functional value of each of its components.
Service model for the future
The traditional service delivery model needs to change to reflect the new reality of on-demand and fluctuating consumption, spread over multiple independent services. The changes required are not limited to a new set of shiny tools the managers should acquire, rather, cover the entire service management lifecycle.
People: The composition and skillset of the operations team needs to reflect the blurring boundaries between application and infrastructure in the Infrastructure-as-a-code (IaaC) world by
- Inducting application developers and scripting experts to write code to analyze logs, customize scripts for alerts and integrate the disparate sources of information in a single repository
- Acquiring new skills and extensive retraining of existing resources to build an understanding of application and infra components in new architecture
- Automating many routine administration tasks by using SNS for topic subscription and Lambda functions for firing a code snippet to perform an activity and utilizing other microservices for taking automated actions based on these alerts
Tools: The arsenal of tools needs to expand to cover services beyond servers and storage. Some examples are
- Include tools for discovery of microservices (DNS based, Route53, Cocnsul), log collection, analysis and alert generation. A popular model in use here is the open source ELK (Elasticsearch, Logstash and Kibana) stack which combines the elements of log collection, analysis and dashboard to provide service visibility.
- Tracing and tagging tools like Zipkin for tracing communication in the distributed architecture
- Cloudwatch for AWS services in general and Prometheus for containers provide a wide range of data points on service performance. SNS and Kafka provide the ability to publish and subscribe to topics or other services to read the event streams.
- Orchestration tools provided natively by CSPs or multi-platform open source tools like Rancher.
- Customization of existing dashboards and toolset to assimilate data from different sources, coding the custom rules and scripting to integrate this data with existing platforms is a must for a unified view of the IT estate. Tools such as Kibana and Grafana provide some of these capabilities, however, some scripting and customization for each service and customer addition is still required.
- In general, alignment to open source based tooling approach and ability to integrate data from different sources is required.
Pricing: Aligning the service delivery price with the service consumption is perhaps the biggest challenge for the service providers. While the underlying management tool cost can be managed since these work on volume of data generated and processed, the resource deployment cannot move linearly up or down with the service consumption.
- Innovative thinking is needed on the part of IT manager to measure and chargeback for the services delivered. The model should have application performance centric metrics such as volume of requests processed in a given time period.
- The billing cycle and metrics should change in line with services consumption that may not exist for the complete cycle. This could be done with moving to per hour rates or predefined % of cloud consumption bill paid to the CSP.
- On the service delivery side, automation of L1 to L1.5 tasks for monitoring and administration, and multi-skilled engineers who can be quickly redeployed to cover the demand spikes provides an optimum delivery model that translates to attractive price points for customers.
The services landscape in the cloud especially around PaaS, is still evolving. The rapid pace of innovation and launch of new services makes it even more complicated to track, evaluate and integrate the management methods into the existing operations framework. To stay ahead, we need to bring the learnings of CI/CD model into our operations framework to quickly adapt our service offerings and delivery model in line with the market demand.