Skip to content

feat: surface OpenTelemetry export errors and config#178

Draft
GregTheGreek wants to merge 1 commit intomainfrom
feat/otel-export-visibility
Draft

feat: surface OpenTelemetry export errors and config#178
GregTheGreek wants to merge 1 commit intomainfrom
feat/otel-export-visibility

Conversation

@GregTheGreek
Copy link
Copy Markdown

Summary

Three small additions to app.Run that turn silent OTLP failures into observable startup signals.

otel.SetErrorHandler(otel.ErrorHandlerFunc(func(err error) {
    log.Error().Err(err).Msg("OpenTelemetry SDK error")
}))

var mp metric.MeterProvider
if url := configuration.RelayerConfig.OpenTelemetryCollectorURL; url == "" {
    log.Warn().Msg("OpenTelemetry disabled: collector URL not configured")
    mp = noop.NewMeterProvider()
} else {
    log.Info().Str("url", url).Msg("Initializing OpenTelemetry metric provider")
    sdkMp, err := observability.InitMetricProvider(context.Background(), url)
    panicOnError(err)
    defer func() { /* shutdown */ }()
    mp = sdkMp
}

Why

Investigating "metrics not appearing in Grafana" today required guessing because the relayer prints nothing about telemetry at startup. Three sharp edges combined into one black box:

1. The OTel SDK silently swallows export errors

The metric SDK uses a global error handler. When none is registered, export failures (connection refused, TLS handshake, 404 on wrong path, context deadline exceeded) are dropped. Registering a handler that routes errors through zerolog makes container logs the source of truth.

2. The configured URL is invisible

Container logs say nothing about the OTel target. There is no way from logs to tell whether OpenTelemetryCollectorURL is empty, malformed, or pointing at an unreachable host. A single INFO log on startup fixes this.

3. An empty URL silently exports to localhost

InitMetricProvider calls url.Parse(agentURL), which returns no error for "". otlpmetrichttp.WithEndpoint("") then falls back to the OTLP HTTP default of localhost:4318. The relayer starts up clean and ships metrics into a void. Treating the empty case as "telemetry disabled" with a noop.MeterProvider and a WARN log makes misconfiguration loud while keeping local-dev simple (no collector required).

Behavior change

  • OTel collector URL set correctly: no behavior change. Same provider, same exporter, same metrics. New: one INFO log at startup, one WARN never fires, the error handler is silent unless the SDK reports a problem.
  • OTel collector URL empty: previously exported to localhost:4318. Now uses noop.NewMeterProvider(), which has zero overhead and emits a single WARN at startup.
  • OTel collector URL malformed or unreachable: previously silent. Now every export attempt logs an ERROR with the underlying SDK error.

Test plan

  • go build ./... clean
  • go vet ./app/... clean
  • Deploy to staging, confirm an INFO log on startup shows the configured collector URL
  • If the URL is wrong, confirm SDK errors now appear in container logs
  • Local dev without SYG_RELAYER_OPENTELEMETRYCOLLECTORURL: confirm WARN logs and the relayer runs cleanly with no metric export attempts

Related

Pairs with #177 (declare seconds unit on duration histograms) which fixes the bucket-view mismatch for the same set of histograms. Together they make the staging metrics pipeline both observable from logs and meaningful in Grafana.

Three small additions to the metric provider init in app.Run that turn
silent OTLP failures into observable startup signals:

1. Register a global otel.SetErrorHandler that routes SDK errors
   (export failures, malformed URLs, TLS handshake errors, context
   deadlines) to zerolog. Without this the SDK swallows export errors
   and the relayer prints nothing.

2. Log the configured collector URL at INFO before initializing the
   provider, so container logs show the target the relayer is
   actually using.

3. Treat an empty OpenTelemetryCollectorURL as "telemetry disabled"
   and substitute a noop.MeterProvider with a WARN log, instead of
   the current behavior where url.Parse("") succeeds and the OTLP
   HTTP exporter silently falls back to localhost:4318.

No behavior change when the collector URL is set correctly.

Co-Authored-By: Claude
@github-actions
Copy link
Copy Markdown

Go Test coverage is 36.9 %\ ✨ ✨ ✨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant