2024-09-27

Open Telemetry Remote Tracing in Go

Earlier today, I tried to use OpenTelemetry for a custom instrumentation. Specifically, I want to track the lifecycle of an event throughout my whole system.

Previously, I mainly relied on auto instrumentation to generate traces, so this was the first time I worked with Otel tracing SDK. Here are my notes after a few hours playing around with it.

The goal is that every time I receive an event, I want to start a root trace. This event will then be executed and sent through a few queues for handling in different workers (or services). During this flow, each of these workers should add spans to this trace so that I can have a holistic view of what happens to each event.

First, when receiving a new event, this is how to start a new root trace:

import (
	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/attribute"
	"go.opentelemetry.io/otel/trace"
)

// ...

traceProvider := otel.GetTracerProvider()
tracer := traceProvider.Tracer("project/internal/packagename")

parentCtx := context.Background()
traceCtx, span := tracer.Start(parentCtx, "event lifecycle")
span.SetAttributes(attribute.String("event.id", event.ID))
defer span.End()

To break things down a bit, TraceProvider is the entrypoint for us to start tracing. This snippet will use the global trace provider, or we can accept a custom trace provider for some specific use case. Using trace provider, we can create a new Tracer instance. It's recommended that we provide a unique name (say the package name) so for a unique tracer instance scoped to that package only.

From then, we can start a new span within the trace from the parentCtx. In this example, as the parentCtx is a brand new context, this trace will be the root trace.

Next, whenever we process the event, we want to add spans to this context. To do so is fairly straightforward if we have access to that traceCtx. However, because the event processing happens in different services, we have to construct the context ourselves using the context's trace ID and span ID.

event.Metadata["trace_id"] = span.SpanContext().TraceID().String()
event.Metadata["span_id"] = span.SpanContext().SpanID().String()

// ... in event processing flow

parentCtx := trace.ContextWithRemoteSpanContext(context.Background(), spanContextFromEvent(event))
_, span := h.tracer.Start(parentCtx, "event processing")
defer span.End()

// ...

func spanContextFromEvent(event Event) (trace.SpanContext, error) {
	traceID, err := trace.TraceIDFromHex(event.Metadata["trace_id"])
	if err != nil {
		return trace.SpanContext{}, err
	}
	spanID, err := trace.SpanIDFromHex(event.Metadata["span_id"])
	if err != nil {
		return trace.SpanContext{}, err
	}
	return trace.NewSpanContext(trace.SpanContextConfig{
		TraceID:    traceID,
		SpanID:     spanID,
		TraceFlags: 01,
		Remote:     true,
	}), nil
}

For example, let's say we can pass trace ID and span ID along with the event via its metadata. When processing the event, we can reconstruct the parent context using these IDs like so. The idea is that we can construct a "remote" span context, with remote meaning the span takes place in a different process than the current running process.

With this snippet, we can observe that the "event processing" is a child span of the root "event lifecycle" span.

I think this is a fairly straightforward flow, but since I'm quite new to OpenTelemetry, it still took me a bit of time to figure out. I hope this has been helpful to you, if you've made it this far. Please feel free to contact me if you have any questions or if you'd like a small demo.