Circuit Breaker with Fallback Improving Resiliency

Circuit Breaker with Fallback Improving Resiliency
Making the Netflix API More Resilient
- Principles of Resiliency by NetFlix
Fault Tolerance in a High Volume, Distributed System
Links

Circuit Breaker with Fallback Improving Resiliency

Circuit Breaker 패턴은 호출 당하는 서비스에서 응답이 늦거나 (미흡한 예외 처리로 인한) 예외가 발생하여 생기는 문제를 다른 마이크로서비스에게 전파되지 않도록 하기 위함이다. 즉, MSA 에서 시스템의 안정성과 복원력을 향상시키기 위한 패턴이다. 서비스 응답이 늦는 경우, 지속적인 호출이 쌓이다보면 thread-hang 이 발생할 수 도 있다.

Fallback 은 호출 당하는 서비스에서 위와 같은 문제가 발생했을때, 호출한 서비스에게 예외를 주는 것이 아닌 대체 로직을 실행해서 결과를 내주도록 하기 위한 매커니즘을 의미한다.

Hystrix fallback prevents cascading failures:

Circuit Breaker Pattern States:

CLOSED
OPEN
HALF-OPEN

초기 상태는 CLOSED 이다. 정상인 상태를 의미한다. 그리고 프록시 구성에 지정된 횟수를 초과하면 상태가 Open 으로 변경되고 타이머가 시작된다. 상태가 OPEN 인 동안에는 서비스 호출이 없고 Fallback 로직을 실행하여 반환한다. 타이머가 종료되면 HALF-OPEN 상태로 변경되며 이때 서비스 호출을 한 번 더 할 수 있다. 만약 정상으로 복구가 되었다면 CLOSED 상태로 변경되고 실패 카운터가 0으로 초기화 된다. 여전히 문제가 발생한다면 OPEN 상태로 변경된다.

// @CircuitBreaker(maxAttempts = 3, openTimeout = 5000L, resetTimeout = 20000L)
@CircuitBreaker(name = "my-service", fallbackMethod = "fallbackRun")
fun run(): String {
    log.info("Calling external service...")
    if (Math.random() > 0.5) {
        throw RemoteAccessException("Something went wrong...")
    }
    log.info("Success calling external service")
    return "Success calling external service"
}

fun fallbackRun(ex: Throwable): String {
    log.error("Fallback for external service: ${ex.message}")
    return "Success on fallback"
}

maxAttempts: fallback 을 호출하기 전에 최대 시도하는 횟수
openTimeout: 최대 실패 시도 횟수를 시도해야 하는 기간
resetTimeout: OPEN to HALF-OPEN timer

Circuit Breaker 패턴을 구현할때 threshold of error 를 잘 정하는 것이 중요한 것 같다.

NetFlix - The tripping of circuits kicks in when a DependencyCommand has passed a certain threshold of error (such as 50% error rate in a 10 second period) and will then reject all requests until health checks succeed.

It’s very important to take into account that the complexity of the Circuit Breaker pattern’s implementation must answer our application’s real needs as well as the business requirements.

Any change in breaker state should be logged and breakers should reveal details of their state for deeper monitoring.

Making the Netflix API More Resilient

Making the Netflix API More Resilient: 넷플릭스의 회로 차단기(circuit breaker) 구현 원칙을 설명하고 있다.

Principles of Resiliency by NetFlix

A failure in a service dependency should not break the user experience for members
The API should automatically take corrective action when one of its service dependencies fails
The API should be able to show us what’s happening right now, in addition to what was happening 15–30 minutes ago, yesterday, last week, etc.

Netflix CircuitBreaker pattern in that fallbacks can be triggered in a few ways:

A request to the remote service times out
The thread pool and bounded task queue used to interact with a service dependency are at 100% capacity
The client library used to interact with a service dependency throws an exception

These buckets of failures factor into a service’s overall error rate and when the error rate exceeds a defined threshold then we “trip” the circuit for that service and immediately serve fallbacks without even attempting to communicate with the remote service.

Netflix Each service that’s wrapped by a circuit breaker implements a fallback using one of the following three approaches:

Custom fallback — in some cases a service’s client library provides a fallback method we can invoke, or in other cases we can use locally available data on an API server (eg, a cookie or local JVM cache) to generate a fallback response
Fail silent — in this case the fallback method simply returns a null value, which is useful if the data provided by the service being invoked is optional for the response that will be sent back to the requesting client
Fail fast — used in cases where the data is required or there’s no good fallback and results in a client getting a 5xx response. This can negatively affect the device UX, which is not ideal, but it keeps API servers healthy and allows the system to recover quickly when the failing service becomes available again.

Fault Tolerance in a High Volume, Distributed System

NetFlix - Fault Tolerance in a High Volume, Distributed System:

NetFlix 블로그 글을 읽어보면 다음과 같이 설명이 된 곳이 있다.

네트워크 호출을 포함하는 종속성 실행의 경우 동시성 및 병렬 처리의 이점이 각 작업에 대해 새 스레드를 생성하는 오버헤드보다 더 크기 때문에 이를 실행하는 데 여전히 별도의 스레드가 사용된다. 하지만 메모리 내 캐시 조회 와 같이 네트워크 호출을 수행하지 않는 종속성 실행의 경우 별도의 스레드를 생성하는 오버헤드가 너무 높을 수 있고, 작업이 빠르게 완료되어야 하는 경우 특히 그렇다.

이 경우에 Semaphore 를 사용하여 Shared Resource 에 대한 액세스를 제어하는 것이 더 효율적일 수 있다.

Semaphore 를 코드로 구현하면 다음과 같다.

It is important to note that every thread uses the same semaphore instance.

import java.util.concurrent.Semaphore

class Cache {
    private val semaphore = Semaphore(10) // allow 10 threads to access the cache at a time
    private val data = mutableMapOf<String, String>()

    fun getValue(key: String): String? {
        // acquire a permit from the semaphore, blocking if necessary
        semaphore.acquire()

        val value = data[key]

        // release the permit when we're done accessing the cache
        semaphore.release()

        return value
    }

    fun setValue(key: String, value: String) {
        // acquire a permit from the semaphore, blocking if necessary
        semaphore.acquire()

        data[key] = value

        // release the permit when we're done accessing the cache
        semaphore.release()
    }
}

Acquire a permit:

// Acquire one permit
semaphoreWithFivePermits.acquire();

// Acquire four permits
semaphoreWithFivePermits.acquire(4);

// Will try to immediately get a permit, and it ignores fairness
semaphoreWithFivePermits.tryAcquire();

// Will wait to acquire a permit for five seconds
semaphoreWithFivePermits.tryAcquire(5, TimeUnit.SECONDS);

tryAcquire() 가 공정하지 않다는 점이 Redisson tryLock() 과 유사하다.