Circuit Breaker Design Pattern for SRE


 

Circuit Breaker Pattern in Site Reliability Engineering.

Among the series of Design Patterns for SRE. Circuit Breaker is one of the most used pattern ensuring system reliability, reducing Mean Time to Detect (MTTD), and improving Mean Time to Recovery (MTTR). It is typically implemented at the service layer or within API gateways to prevent excessive retries to a failing service. The Circuit Breaker pattern is a design pattern used in software development to detect failures and encapsulate the logic of preventing a failure from constantly recurring, which can help maintain the stability and resilience of a system.
Following Figures shows the working of a Service, during normal mode and failure mode. When the service does not respond to the request, the user goes through an indefinite wait period. Resulting in a poor user experience. To improve the user experience implement the Circuit breaker pattern. It helps in providing error response to the user. This helps in increasing the user experience. 


The Circuit Breaker pattern works similarly to an electrical circuit breaker. The following figure shows how the circuit breaker logic is introduced in the network or code.

      

   The Circuit Breaker has three states in which the system operates:

  • Closed - State: The circuit is closed, and requests/ response flow normally when the service is working fine.
    • In this state, the circuit breaker allows all requests to pass through to the service.
    • It monitors the number of failures and their frequency.
    • If the failure rate exceeds a predefined threshold, the circuit breaker transitions or trips  to the open state.
  •  Open - State: The circuit is open, and requests are blocked to prevent further failures, when the service stops responding. and the requests are piled in the queue.
    • In the open state, the circuit breaker stops forwarding requests to the failing service.
    • It returns a predefined fallback response, an error message or forwards the request to another available backup system.
    • This state allows the failing service time to recover without additional load.
    • After pre-defined cooling period the circuit breaker will move into half – open state

  •  Half-Open - State: The circuit is in a trial state to check if the underlying issue has been resolved. After a cooling-off period, the circuit breaker transitions to the half-open state.
    • It allows a limited number of test requests to pass through to the service.
    • If these requests succeed, the circuit breaker moves back to the closed state.
    • If they fail, it reverts to the open state.


This pattern is particularly useful in Site Reliability Engineering (SRE) to manage the reliability of services. Lets see two case studies.

Case Studies

Circuit Breaker Desing Pattern for Microservices for Site reliability.

  1. Monitoring and Metrics:

    • Monitor metrics: such as error rate, response time, and timeouts.
    • Monitor Service Calls: Track the success and failure rates of service calls.
    • Define Thresholds: Set thresholds for failures that will trigger the circuit breaker, These metrics help determine when to trip the circuit breaker.
    • State Management: Implement logic to transition between closed, open, and half-open states. 

  1. Fallback Mechanism:
    • When the circuit breaker is open, it can provide a fallback mechanism.
    • This could be a default response, cached data, or a message indicating the service is unavailable.

 Sequential Diagram



Case Study 2

Use of Loadbalancer

Combining the Circuit Breaker design pattern with a load balancer can significantly enhance the resilience and fault tolerance of your microservices architecture. Here's how it works:

Integration of Circuit Breaker and Load Balancer

Load Balancer Role:

    • A load balancer distributes incoming requests across multiple instances of a service.
    • It ensures that no single instance is overwhelmed, improving the overall availability and reliability of the service.

Circuit Breaker Role:

    • The Circuit Breaker pattern monitors the health of service instances.
    • It detects failures and prevents requests from being sent to failing instances, allowing them time to recover.

 How They Work Together


Health Checks:

    • If an instance fails a health check, the load balancer stops sending traffic to it.
    • The load balancer performs regular health checks on service instances.

Failure Detection:

    • The circuit breaker monitors the success and failure rates of requests. 
    • If the failure rate exceeds a threshold, the circuit breaker trips, and the instance is marked as unhealthy.

Traffic Routing:

    • When the circuit breaker trips, it informs the load balancer to stop routing traffic to the failing instance.
    • The load balancer then redirects traffic to healthy instances, ensuring continuous service availability.

Recovery: 

    • After a predefined period, the circuit breaker allows a limited number of test requests to the failing instance.
    • If these requests succeed, the instance is marked as healthy, and the load balancer resumes routing traffic to it.

SRE Implementation Strategies:

Strategic areas where circuit breaker fit into SRE Practice: SLIs, SLOs & Error Budgets

    • SLIs (Service Level Indicators) measure failure rates and latencies.
    • SLOs (Service Level Objectives) define acceptable failure thresholds.
    • Circuit Breakers help enforce SLOs by preventing widespread failures.

Monitoring & Alerting

    • Use SLIs/SLOs to define failure thresholds.
    • Integrate with observability tools (e.g., Prometheus, Grafana, Datadog).
    • Log circuit state changes for post-mortems and RCA.

Dynamic Circuit Breakers

    • Adaptive thresholds using ML-based anomaly detection.
    • Autoscaling based on service health.

Fallback Strategies

    • Return cached responses (for read-heavy traffic).
    • Serve default values (graceful degradation).
    • Redirect to backup services.

Integration with Incident Management

    • Automatically create incidents when circuits trip frequently.
    • Tie circuit states to runbooks and escalation policies.

Tools & Frameworks for Circuit Breaker

Circuit Breakers are typically implemented using various tools and frameworks that help monitor failures, control traffic, and prevent cascading system failures. Below is a categorized list of tools and frameworks used for Circuit Breaker implementation in microservices, cloud environments, and service mesh architectures.

Following are few tools and frameworks available and widely used in industry

Open-Source Circuit Breaker Libraries

These libraries are used within applications to implement the Circuit Breaker pattern at the service or API level. 

Resilience4j (Java)

    • Lightweight and modular circuit breaker library for Java.
    • Supports rate limiting, retry, bulkhead, and fallback mechanisms.
    • Integrates well with Spring Boot and Micronaut.
    • Provides metrics via Micrometer, Prometheus, Grafana.

Hystrix (Netflix OSS - Java) (Deprecated)

    • Legacy circuit breaker library developed by Netflix.
    • Provided fault tolerance in microservices.
    • Deprecated in favor of Resilience4j and service meshes like Istio.

Polly (.NET)

    • Circuit breaker for .NET applications.
    • Supports retry, timeout, fallback, and bulkhead isolation.
    • Integrates with ASP.NET Core.
 Sentinel (Alibaba)
    • Advanced flow control and circuit breaker library.
    • Designed for high-performance systems (e.g., e-commerce, financial services).
    • Supports dashboard monitoring, adaptive traffic shaping.

Service Mesh-Based Circuit Breakers

These tools are used in Kubernetes and cloud-native architectures to provide traffic management, circuit breaking, and observability.

Istio

    • Sidecar-based service mesh that provides built-in circuit breaker support.
    • Uses Envoy Proxy for failure handling, rate limiting, and retries.
    • Deep integration with Prometheus, Grafana, and Kiali for observability.

Linkerd

    • Lightweight service mesh with built-in circuit breaking and retry policies.
    • Focuses on performance and simplicity compared to Istio.
    • Works well for smaller Kubernetes clusters.

Envoy Proxy

    • High-performance service proxy used in Istio, AWS App Mesh, and Gloo Edge.
    • Supports dynamic circuit breaking, request hedging, and rate limiting.
    • Can be used independently without a full service mesh.

Consul (by HashiCorp)

    • Provides service discovery, health checks, and circuit breakers.
    • Works with Envoy for traffic control and failover.
    • Used in multi-cloud and hybrid environments.

Cloud-Native Circuit Breakers

Managed solutions in AWS, Azure, and GCP that handle circuit breaking without needing custom libraries.

AWS API Gateway & AWS App Mesh

    • AWS API Gateway provides throttling, rate limiting, and request retries.
    • AWS App Mesh (Envoy-based) adds circuit breaking to microservices.
    • Works with CloudWatch & AWS X-Ray for monitoring.

Azure API Management

    • Managed API gateway with circuit breaking, throttling, and failover.
    • Supports policies to prevent failures from overloading backends.

GCP Cloud Endpoints & Traffic Director

    • GCP Cloud Endpoints offers rate limiting & API resilience features.
    • GCP Traffic Director (based on Envoy) provides dynamic circuit breaking for microservices.

Observability & Monitoring for Circuit Breakers

These tools help monitor circuit breaker states, detect failures, and automate recovery.

Prometheus & Grafana

    • Used to track circuit breaker events, failures, retries.
    • Works with Istio, Envoy, Resilience4j, AWS X-Ray.
    • Provides real-time dashboards for circuit health.

Datadog

    • Monitors circuit breaker states, latency, request failures.
    • Supports distributed tracing for microservices.

Open Telemetry

    • Provides end-to-end tracing of circuit breaker failures.
    • Works with Jaeger, Zipkin, Grafana Tempo for distributed tracing.

Advanced Circuit Breaker Strategies for SRE

    • Adaptive Circuit Breakers: Use ML-based anomaly detection to adjust thresholds dynamically.
    • Ejection-Based Circuit Breakers: Used in Envoy to remove unhealthy nodes from load balancing.
    • Automated Remediation: Integrate with Incident Response Tools (PagerDuty, Opsgenie, VictorOps).

 Final Takeaways for SRE

    • Enhanced Resilience: By combining the Circuit Breaker pattern with a load balancer, you can prevent cascading failures and ensure that traffic is always routed to healthy instances.
    • Reduction on unnecessary load and latency: Prevent client from calling the non responding services. Ensure dependent services do not get stuck on retry loops.
    • Improved Fault Tolerance: The system can handle failures gracefully, providing fallback responses and maintaining service availability.
    • Enhance Observability: Provide failure insights through circuit breaker metrics. Helps the SREs to detect and act on the trends before incidents escalate. Can trigger automated alerts. Helps in Root Cause Analysis (RCA) by providing clear logs of failure rates and circuit states.
    • Efficient Resource Utilization: Load balancers distribute traffic evenly, preventing any single instance from becoming a bottleneck.
    • Better User Experience: Provides fallback responses, ensuring that users receive a response even if a service is down. 
    • Ensure Graceful Degradation: Instead of crashing, the system can provide fallback responses (e.g., cached data, static content, or default responses). Improves user experience by keeping the system partially available instead of completely failing.

"A well-placed Circuit Breaker is not about stopping failures; it's about containing them, preventing small cracks from becoming system-wide outages."

Thank you for reading the post and giving your feedback. This is my first post on the series of SRE design Pattern. Keep viewing for the next post on Reliability Patterns.

Comments