Understanding Microservice Failures

Rahul Agarwal
7 min readFeb 27, 2021
Photo by the blowup on Unsplash

“Software does not fail” is a provocative statement and your initial reaction maybe to dismiss it. But if we look deeper, software “failure” is never random compared to possible random failure of hardware components (refer to Dr Nancy Leveson, YouTube, Engineering a Safer World). Yes, there are hard to reproduce issues but software execution is as per the code. The failure is of assumptions, requirements and unhandled conditions that go into writing that code. For example, the recent AWS us-east-1 incident involved no hardware or software failure, rather the complexity of interaction of various components and assumptions made in the software. The traditional way to improve is to enumerate test cases against all anticipated errors and undesired behaviors and ensure appropriate handling of those exists. Assuming no regressions when brand new unexpected scenarios occur those need to be understood and addressed. Additionally many incidents occur due to human error (for example AWS S3 incident — “one of the inputs to the command was entered incorrectly”) and this causes “processes” to be put in place to marginalize the human operator by adding automation, training, and onerous steps that may even be counterproductive. It gets worse if you have worked in regulated industries or AWS GovCloud. From Dr Leveson:

Human error is a symptom of a system that needs to be redesigned.

So how can we understand and redesign our systems? With the “unlimited complexity of software systems” this is a very hard problem and we should consider some additional approaches and I will discuss STAMP (System Theoretic Accident Model and Processes) and STPA (System Theoretic Process Analysis) and add to what adrian cockcroft has proposed in the context of a typical container based microservices architecture.

This article assumes a general working knowledge if not direct experience with operating a microservices environment which includes SRE principles (see Google SRE) and resiliency principles (reference Netflix blog). Also for reference Managing Failure Modes, excellent collection of curated STAMP links, and Adrian Cockcroft at re:Invent 2020. Additionally, Appendix F in the STPA Handbook has a good description of concepts I found useful since this is all new for me.

Safety and SLOs

In systems theory the goal is to understand the entire system as a whole and not just the individual components that constitute that system and was developed for systems that exhibit “organized complexity”. Reliability is a property of a component, and while each component may be reliable and behave as expected when constituted into a system will that system exhibit any undesired behaviors? This fact that system outcome maybe different than those of components and it is not possible understand this at the component level is referred to as an “emergent property” of constituted components. Any undesired behavior is considered a “loss” and the absence of loss indicates a “safe” system. Safety is therefore an emergent property in systems theory. For example, the reliability of JDK Math package is well defined but a system using it may not be safe. Another example, the reliability of S3 is well defined but a backup system based on S3 may not be safe.

A system that is composed of multiple components C1 to C5 with interactions between them.
System and components. STPA Handbook, N. Leveson, J. Thomas

For our purpose, we will consider not meeting the defined Service Level Objectives (SLOs) as a loss and therefore we have web service safety when SLOs met (reference SLOs — Google SRE).

The Control Loop

Continuing with systems theory, the emergent properties that arise in a system due to component interactions should be controlled to meet our objectives and therefore we need a “controller” that can affect change in a “controlled process.” The controlled process provides feedback that the controller can use to monitor it and based on its beliefs (process model) it can apply control actions as determined by its control algorithms to bring the controlled process to a desired state. This is represented below.

Control Loop
Control Loop. J.Thomas

The standard loop shown above can be extended to have a hierarchy of controllers as well. For example, automated and human controllers. One example from Dr Leveson is the adaptive cruise control system in a car. The car’s acceleration and brake are the controlled processes, the car movement and radar are sensors that provide feedback about speed and distance from the car in front and the cruise controller applies the control actions to apply brakes or gas as necessary. The driver (human controller) may provide their own control actions as well.

Using this model, for an example in the context of web services, the pods are the controlled process that provide feedback about their memory/CPU/health and the Kubernetes controller uses these metrics and its algorithms to determine if a new pod should be spun up (control action). This control action could be automated or alternatively a human controller may intervene (using a runbook) and alter suppose the min/max pod count and affect change.

Control loop example for container deployment
Control loop example for container deployment

STAMP and STPA

STAMP is a framework for accident-causality analysis and defines unsafe systems to be caused by a control problem. It includes software, humans, operations and management and not just the components and human errors. There are multiple tools that use this framework, and we will look at STPA which is a hazard analysis technique. It has a top-down control approach and identifies the constraints and requirements that causes unsafe systems and components. Unsafe control actions from various controllers on the hierarchy cause incidents and these the four primary types (adapted from J. Thomas)

  1. Control action not given — pod failure does not result new pod being created
  2. Unsafe control action given — incorrect action such as pod delete is given
  3. Potentially safe control action but not timed correctly — applied too early or too late — pod failure but new pod creation is delayed
  4. Control action inadequate — action stops too soon or applied too long — a circuit breaker is opened but left open for too long even though underlying service has recovered

Note that timing is critical and there is a lag before the applied control action results in an updated feedback from the controlled process so the process model must account for this.

The Microservice Control Loop

Let us now apply all the above concepts to a typical microservice. I have depicted it in the following hierarchical control loop model. The goal is to create a general starting point so the controllers chosen would typically apply in all scenarios.

Hierarchical Microservices Control Loop
Hierarchical Microservices Control Loop
  1. The pods and the app are the “controlled processes.” This is the data plane and process customer requests.
  2. The controlled process provides metrics (feedback) that the controller(s) can use to control and enforce constraints to improve safety.
  3. There is a hierarchy of “controllers” that control these controlled processes. These are represented as automated and manual controllers.

Let us look at each:

  1. Deployment Controller — this is the deployment tool (such as ArgoCD). The feedback it receives is which images+versions are deployed in which cluster and the desired state as per some git configuration. Upon detecting a drift, it issues control actions to the Kubernetes controller.
  2. Kubernetes Controller — we can think of this as receiving feedback from kublets on each node in the cluster with regard to pod state and applying control actions to bring the state to the desired configuration as provided by the deployment controller.
  3. Feature Flag and Configuration Controller — based on environment configuration properties and feature flags this will apply control actions (generally to App) to the apply desired constraints.
  4. Circuit Breaker Controller (egress) — this receives feedback from various circuits in the App and applies control actions to open/close/throttle them as necessary.
  5. API Gateway Controller (ingress) — not to be confused with a k8s ingress controller this receives feedback from the App and applies control action to throttle traffic to the App.
  6. SRE and Development Team Controller — these are the teams supporting the service and individuals receive feedback via various observability tools such as Prometheus/Grafana/Wavefront/CloudWatch/PagerDuty/custom tools and dashboards, logs etc and apply control actions via cli/tools/scripts etc based on Runbooks/tribal knowledge.
  7. Management Controller — these are individuals that usually approve change requests/make business decisions or otherwise influence the development or operation of a web service.

Producer-Consumer Control Loop

For a more complete real example let us consider a producer-consumer problem. Based on certain customer actions such as a password change, we need to notify the customer via email, push notification or other means. The customer request produces an event and queues it. The consumer then processes this event and performs the necessary task(s).

Producer-Consumer control loop
Producer-Consumer control loop

The producer and consumer can be considered as subsystems. Using a simple AWS queue as the buffer and the controllers are shown as shared between producer/consumer for simplicity. In real implementation circuit breaker controllers would be independent while others may be shared.

Conclusion

We can model our web services based on systems theory. I have not seen the Google SRE team reference this but the SRE principles align in many ways. The next step is to identify unsafe control operations and the loss scenarios and address them (refer STPA Handbook Step 3 and 4). Additionally, I am still unclear how to model multi-region (multi-data center) deployments of a web service. If these topics interest you then reach out to me and I will appreciate any feedback. If you would like to work on such problems you will generally find open roles as well! Please refer to LinkedIn.

--

--