Distributed Tracing

Background
- Definition
Properties
- Trace
- Span
Architecture
- Asynchronous processing with bounded buffer queue
OpenTracing
LogContext
Different Characteristics in Contexts
Links
References

Background

분산 추적(distributed tracing)은 MSA 환경에서 수 많은 마이크로서비스들 간의 트랜잭션 로그를 추적, 모니터링 하는 것을 의미한다.

Definition

Distributed tracing is a technique used to track the flow of a request or a transaction as it travels across multiple microservices or systems in a distributed architecture. The goal of distributed tracing is to provide a holistic view of how a request is handled and how different components of a system interact with each other, which can be useful for identifying performance bottlenecks, debugging issues, and improving overall system observability. Distributed tracing is typically implemented using a combination of instrumentation, correlation IDs, and a central trace storage and querying system.

Properties

Traces that are stitched together form special events called “spans” ; spans help you track a causal chain through a microservices ecosystem for a single transaction. To accomplish this, each service passes correlation identifiers, known as “trace context,” to each other; this trace context is used to add attributes on the span.

Trace

Span Id 는 분산 시스템에서 per-request 안에 존재하는 각각의 작업에 부여된 작업 Id 를 의미하는 반면에, Trace Id 는 하나의 요청이 시작되고 처리되기까지의 모든 과정을 포함하는 Id 이다.

<------------------------- Same Trace Id ------------------------->

––|–––––––|–––––––|–––––––|–––––––|–––––––|–––––––|–––––––|–> time

 [Span A···················································]
   [Span B··············································]
      [Span D··········································]
    [Span C········································]
         [Span E·······]        [Span F··] [Span G··] [Span H··]

TraceId Generation Rule Examples:

The TraceId is typically generated by the first server that receives the request. The generation rule is: server IP + generated time + incremental sequence + current process ID
- e.g 0ad1348f1403169275002100356696
- 처음 8자리 0ad1348f는 TraceId를 생성하는 시스템의 IP.
- 다음 13자리 숫자 1403169275002는 TraceId를 생성하는 시간.
- 다음 4자리 1003은 1000에서 9000으로 증가하는 자동 증가 시퀀스.
- 마지막 5자리 56696이 현재 프로세스 ID 이다. TracerId의 역할은 단일 시스템의 여러 프로세스로 인해 발생하는 TraceId 충돌을 방지하는 것이다.

Calling chain to pass ThreadLocal objects across threads:

Thread 간 threadlocal 을 사용하여 TraceID/SpanID 전달할 수 있다.

Span

In distributed tracing, a span is a unit of work that is tracked as it propagates through a distributed system. A span typically includes information such as the start and end times of the work, the operation name, and any associated metadata. Spans can be nested to form a trace, which allows for the visualization and analysis of the flow of a request as it traverses different services and components in a distributed system.

Span 은 분산 추적에서 가장 기본이 되는 논리 단위(작업 단위) 를 의미한다. Span Id 는 분산 시스템에서 per-request 안에 존재하는 각각의 작업에 부여된 작업 Id 를 의미한다. 따라서, Span ID 는 동일한 Trace ID 내에서 발생한 모든 호출의 실행 순서를 결정하는 데 사용될 수 있다.

일반적으로 작업의 시작 시간과 종료 시간, 작업 이름 및 관련 메타데이터와 같은 정보가 포함된다. 스팬은 추적을 형성하기 위해 중첩될 수 있으며, 이를 통해 분산 시스템에서 서로 다른 서비스와 구성 요소를 통과할 때 요청 흐름을 시각화하고 분석할 수 있다.

For example, the following is an example Trace made up of 8 Spans:

Causal relationships between Spans in a single Trace


        [Span A]  ←←←(the root span)
            |
     +------+------+
     |             |
 [Span B]      [Span C] ←←←(Span C is a `ChildOf` Span A)
     |             |
 [Span D]      +---+-------+
               |           |
           [Span E]    [Span F] >>> [Span G] >>> [Span H]
                                       ↑
                                       ↑
                                       ↑
                         (Span G `FollowsFrom` Span F)

Sometimes it's easier to visualize Traces with a time axis as in the diagram below:

Temporal relationships between Spans in a single Trace


––|–––––––|–––––––|–––––––|–––––––|–––––––|–––––––|–––––––|–> time

 [Span A···················································]
   [Span B··············································]
      [Span D··········································]
    [Span C········································]
         [Span E·······]        [Span F··] [Span G··] [Span H··]

위 처럼 시간의 흐름 형태로 구성하는 것이 모니터링에 더 도움이 된다. 아래는 Datadog 에서 제공하는 화면이다.

Span 에는 다음과 같은 정보들을 가지고 있다.

작업 이름
작업 시작시간, 종료시간
key, value 형태의 Tags, Logs
Span Contexts

Span Contexts:

Span Contexts 는 로그 추적을 더 수월하게 하기 위해 Span 에 남기는 부가 정보(additional information) 들을 가지고 있는 Context 라고 생각하면 된다.

// Create a new tracer
val tracer = Configuration.fromEnv().tracer

// Start a new span and set its operation name
val span = tracer.buildSpan("my-operation").start()

// Add some baggage to the span context
span.setBaggageItem("user-id", "12345")

// Propagate the span context in the request headers
val headers = tracer.inject(span.context(), Format.Builtin.HTTP_HEADERS, object : TextMap {
    override fun put(key: String, value: String) {
        // Add the headers to the request
        // ...
    }

    override fun iterator(): MutableIterator<MutableMap.MutableEntry<String, String>> {
        throw UnsupportedOperationException("Not implemented")
    }
})

// Use the request to call a service
// ...

// Extract the span context from the response headers
val extractedContext = tracer.extract(Format.Builtin.HTTP_HEADERS, object : TextMap {
    override fun put(key: String, value: String) {
        throw UnsupportedOperationException("Not implemented")
    }

    override fun iterator(): MutableIterator<MutableMap.MutableEntry<String, String>> {
        // Get the headers from the response
        // ...
    }
})

// Start a new span as a child of the extracted span
val childSpan = tracer.buildSpan("child-operation").asChildOf(extractedContext).start()

// ...

// Close the span
span.finish()

Architecture

Asynchronous processing with bounded buffer queue

No matter what approach the data collector adopts, the threads for sending out telemetry data must be separated from business threads. Call it using a background threads pool.
There should be a queue between business threads and background threads. And this queue should have bounded size to avoid out of memory issue.

┌─────────────────────────────────────────────────────────────────────────────────┐                                            
│                                   Application                                   │                                            
│                                                                                 │                                            
│                                                                                 │                                            
│   ┌───────────────────┐       ┌───────────────┐       ┌─────────────────────┐   │                                            
│   │                   │       │               │       │                     │   │                                            
│   │                   │       │               │       │                     │   │     ┌────────────┐      ┌─────────────────┐
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │  Business logic   │       │ Bounded size  │       │                     │   │     │            │      │Log/Trace/Metrics│
│   │      threads      │──────▶│queue to avoid │──────▶│ Background threads  │   │────▶│Kafka / UDP │─────▶│    Processor    │
│   │                   │       │ Out of Memory │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     └────────────┘      └─────────────────┘
│   │                   │       │               │       │                     │   │                                            
│   └───────────────────┘       └───────────────┘       └─────────────────────┘   │                                            
│                                                                                 │                                            
│                                                                                 │                                            
│                                                                                 │                                            
└─────────────────────────────────────────────────────────────────────────────────┘

Related Articles:

OpenTracing

Java 진영의 ORM 에서는 JPA 라는 표준 명세가 있고 표준을 구현하는 각 Framework 들이 존재한다.

OpenTracing 은 Cloud Native Computing Foundation 에서 구현한 비공식 분산 추적 표준 이다. 이를 구현한 Jaeger, Elastic APM, Zipkin 등의 여러가지 Tracer 들이 존재한다.

import io.opentracing.Span
import io.opentracing.Tracer
import io.opentracing.util.GlobalTracer
import io.undertow.Undertow
import io.undertow.server.HttpHandler
import io.undertow.server.HttpServerExchange

fun main() {
    // Initialize the tracer
    val tracer: Tracer = ...
    GlobalTracer.register(tracer)

    // Create the server
    val server = Undertow.builder()
        .addHttpListener(8080, "localhost")
        .setHandler(object: HttpHandler {
            override fun handleRequest(exchange: HttpServerExchange) {
                // Start a new span for the request
                val span: Span = tracer.buildSpan("handle-request").start()

                try {
                    // Do some work here...
                    exchange.responseSender.send("Hello, World!")
                } finally {
                    // Finish the span
                    span.finish()
                }
            }
        })
        .build()
    server.start()
}

This is a simple example, but in a real-world application, you would likely want to extract more information from the request and add tags and logs to the span to provide more context and make it easier to understand the trace. Also, you should be careful about the performance issue when adding tracing to your code.

Related Articles:

LogContext

LogContext with MDC

@Slf4j
@Component
public class CustomConsoleAppender extends AsyncAppenderBase<ILoggingEvent> implements ApplicationContextAware {

  private final ObjectMapper objectMapper;
    
  public CustomConsoleAppender() {
    objectMapper = new ObjectMapper();
    super.addAppender(new AsyncAppender());
  }

  @Override
  protected void append(ILoggingEvent iLoggingEvent) {
    MDC.put("span_id");
    MDC.put("trace_id");
    
    // Span Contexts
    Map<String, String> messages = new HashMap<>();
    messages.put("httpMethod", "POST");
    messages.put("appVersion", "1");
    String logMessages = objectMapper.writeValueAsString(messages);
    log.info("{}", logMessages);
    
    MDC.remove("span_id");
    MDC.remove("trace_id");
  }
}

Different Characteristics in Contexts

Span context and Log context are both ways to propagate contextual information through a system, but they serve different purposes and have different characteristics.

Span context is used in distributed tracing to track the flow of a request as it propagates through a distributed system. It includes information such as the trace and span IDs, which are used to uniquely identify the trace and the current span, respectively. It also includes other information such as baggage items and flags that can be used to propagate metadata or control the sampling of traces. Span context is typically propagated in the headers of a request or response, but can also be passed in other ways such as through a message queue or in-memory data structure.
Log context, on the other hand, is used to propagate additional information that can be used to enrich log entries. It typically includes information such as the user ID, request ID, or other information that can be used to correlate log entries with a specific request or user. Log context is usually propagated through thread-local variables or MDC (Mapped Diagnostic Context) and it is used to enrich log statements with additional information.

In summary, span context is used for distributed tracing and log context is used for logging. Both are used to propagate contextual information through a system, but they serve different purposes and have different characteristics.

References

마이크로서비스 인 액션 / 모건 브루스, 파울로 페레이라 저 / 위키북스