OpenTelemetry as APM replacement: one SDK, any backend

Published: 2026-06-11

APM tools — Elastic APM, Datadog, Dynatrace, New Relic — sell the same product: traces, metrics, and logs from your app, visualized in their UI. The cost is vendor lock-in: proprietary agents, proprietary protocols, and pricing that scales with traffic. OpenTelemetry breaks that model. One SDK, one exporter config, and you can route signals to any backend — or several at once. Migrating away from a vendor stops being a two-week refactoring job and becomes a config change in the OTel Collector.


Why vendor APM creates lock-in

A Datadog agent instruments your Go service with dd-trace-go. A New Relic agent uses newrelic-go-agent. An Elastic APM agent uses go-elastic-apm. These are different packages with different APIs. When you switch vendors, you rewrite instrumentation code.

The deeper lock-in is the protocol. Datadog agents speak the Datadog API. New Relic agents speak the New Relic API. Even if the UI is worse, the sunk cost of re-instrumenting 50 services keeps teams on the original vendor for years.

OpenTelemetry is the CNCF-graduated standard for observability signals. The SDK is vendor-neutral. The protocol (OTLP) is vendor-neutral. Instrumentation code written against OTel works with every backend that speaks OTLP — Jaeger, Grafana Tempo, Zipkin, Honeycomb, Datadog (it accepts OTLP too), Elastic, and dozens more.


What OpenTelemetry gives you

Three signal types, one SDK:

  • Traces — spans representing work done by your service, linked into a trace tree across services
  • Metrics — counters, histograms, gauges; same data Prometheus scrapes but with distributed context
  • Logs — structured log entries correlated to traces via trace_id and span_id

The OTel Collector is the routing layer. Apps send all signals to it over OTLP. The Collector fans out to whichever backends you run. You can send traces to Tempo and Jaeger simultaneously during a migration. You can filter, transform, and batch signals before they reach backends.


Architecture: replacing Elastic APM with Grafana stack

Before:

App (elastic-apm-agent) → APM Server → Elasticsearch → Kibana APM

After:

App (OTel SDK) → otel-collector → Grafana Tempo     (traces)
                               → Prometheus          (metrics)
                               → Loki                (logs)

Grafana becomes the single UI for all three signals, with correlation built in: click a trace span, see the logs that fired during that span, pivot to the latency histogram for that service. The APM Server and its Elasticsearch index overhead go away. Tempo stores traces efficiently in object storage. Loki stores logs cheaply.


OTel Collector: fan-out pipeline

A single Collector deployment in the observability namespace handles the whole cluster:

yamlapiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: otel-collector
  namespace: observability
spec:
  chart:
    spec:
      chart: opentelemetry-collector
      version: "0.x"
      sourceRef:
        kind: HelmRepository
        name: open-telemetry
        namespace: flux-system
  values:
    mode: deployment
    replicaCount: 2
    config:
      receivers:
        otlp:
          protocols:
            grpc:
              endpoint: 0.0.0.0:4317
            http:
              endpoint: 0.0.0.0:4318

      processors:
        memory_limiter:
          limit_mib: 512
        batch:
          timeout: 5s
          send_batch_size: 1000
        resource:
          attributes:
            - key: k8s.cluster.name
              value: "prod"
              action: upsert

      exporters:
        otlp/tempo:
          endpoint: "tempo.observability.svc.cluster.local:4317"
          tls:
            insecure: true
        prometheusremotewrite:
          endpoint: "http://vmsingle.observability.svc.cluster.local:8428/api/v1/write"
        loki:
          endpoint: "http://loki.observability.svc.cluster.local:3100/loki/api/v1/push"
          default_labels_enabled:
            exporter: false
            job: true
          labels:
            resource:
              service.name: "service_name"
              k8s.namespace.name: "namespace"

      service:
        pipelines:
          traces:
            receivers: [otlp]
            processors: [memory_limiter, batch, resource]
            exporters: [otlp/tempo]
          metrics:
            receivers: [otlp]
            processors: [memory_limiter, batch]
            exporters: [prometheusremotewrite]
          logs:
            receivers: [otlp]
            processors: [memory_limiter, batch, resource]
            exporters: [loki]

The resource processor stamps every signal with k8s.cluster.name. Useful when multiple clusters feed the same backend — you can filter by cluster in Grafana without changing app code.

The batch processor is not optional. Without it, each span is a separate gRPC call. At 1000 requests/second a service generates thousands of spans/second; batching them reduces Collector CPU by an order of magnitude. Note the order: memory_limiter must come before batch — if it comes after, batches are already assembled in memory when the limit is hit.


Grafana Tempo for traces

Tempo replaces the Kibana APM trace view. For on-prem clusters, use local storage or MinIO:

yamlapiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: tempo
  namespace: observability
spec:
  chart:
    spec:
      chart: tempo
      version: "1.x"
      sourceRef:
        kind: HelmRepository
        name: grafana
        namespace: flux-system
  values:
    tempo:
      storage:
        trace:
          backend: local
          local:
            path: /var/tempo/traces
      persistence:
        enabled: true
        size: 20Gi
      receivers:
        otlp:
          protocols:
            grpc:
              endpoint: 0.0.0.0:4317

For production with S3 or MinIO:

yamltempo:
  storage:
    trace:
      backend: s3
      s3:
        bucket: tempo-traces
        endpoint: minio.observability.svc.cluster.local:9000
        access_key: "${MINIO_ACCESS_KEY}"
        secret_key: "${MINIO_SECRET_KEY}"
        insecure: true

Tempo 2.x adds TraceQL — a query language for trace search that Grafana 10+ uses natively. Pin to tempo >= 1.7.x to get Tempo 2.x.


Trace-to-log correlation in Grafana

The datasource provisioning config links trace spans to Loki log queries:

yamlapiVersion: 1
datasources:
  - name: Tempo
    type: tempo
    url: http://tempo.observability.svc.cluster.local:3100
    jsonData:
      tracesToLogsV2:
        datasourceUid: loki
        spanStartTimeShift: "-1m"
        spanEndTimeShift: "1m"
        filterByTraceID: true
        customQuery: true
        query: '{service_name="${__span.tags.service.name}"} | trace_id = "${__trace.traceId}"'
      serviceMap:
        datasourceUid: prometheus

When you click a span in Grafana, it executes the Loki query for that trace_id automatically. This is the correlation that APM vendors charged premium licensing for — here it is just config.

For it to work, your app's logs must carry trace_id. In .NET:

csharpbuilder.Logging.AddOpenTelemetry(logging =>
{
    logging.IncludeScopes = true;
    logging.AddOtlpExporter(opts =>
        opts.Endpoint = new Uri("http://otel-collector.observability.svc.cluster.local:4317"));
});

In Go with the slog bridge:

goimport (
    "go.opentelemetry.io/contrib/bridges/otelslog"
    "go.opentelemetry.io/otel/sdk/log"
)

loggerProvider := log.NewLoggerProvider(
    log.WithProcessor(log.NewBatchProcessor(otlpExporter)),
)
slog.SetDefault(otelslog.NewLogger("my-service", otelslog.WithLoggerProvider(loggerProvider)))

Migration path from Elastic APM

The migration is incremental. The OTel Collector can forward to both Elastic APM Server and Grafana Tempo simultaneously:

yamlexporters:
  otlp/tempo:
    endpoint: "tempo.observability.svc.cluster.local:4317"
    tls:
      insecure: true
  otlp/elastic:
    endpoint: "apm-server.observability.svc.cluster.local:8200"
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo, otlp/elastic]  # dual-write during migration

Run both in parallel. Validate that Tempo shows the same traces as Kibana APM. Then remove otlp/elastic from the pipeline and decommission APM Server.

Apps still using Elastic APM agents (not OTel SDK) are the harder part. Options:

  1. Redirect agent to send OTLP — Elastic APM Java/Node agents 1.40+ support OTLP output mode (ELASTIC_APM_OPENTELEMETRY_BRIDGE_ENABLED=true)
  2. Keep APM Server temporarily — Elastic APM agents → APM Server → OTel Collector (APM Server has an OTLP exporter)
  3. Replace the agent — rewrite instrumentation using OTel SDK; usually a day's work per service with auto-instrumentation

Option 1 is the lowest effort for supported agent versions.


What can go wrong

Traces appear in Collector but not in Tempo

Check the exporter:

bashkubectl port-forward -n observability svc/otel-collector 8888:8888
curl http://localhost:8888/metrics | grep otelcol_exporter
# otelcol_exporter_sent_spans should be > 0
# otelcol_exporter_send_failed_spans should be 0

If sent_spans is 0, the pipeline config has an error. Enable debug logging:

yamlenv:
  - name: OTEL_LOG_LEVEL
    value: debug

Log correlation not working in Grafana

Verify trace_id is present in log entries reaching Loki:

bashkubectl port-forward -n observability svc/loki 3100:3100
curl -G "http://localhost:3100/loki/api/v1/query" \
  --data-urlencode 'query={service_name="my-service"} | json | trace_id != ""' \
  | jq '.data.result[0].values[0]'

If trace_id is missing, the app is not exporting logs through OTel — it is only exporting traces. Both must use the same OTel SDK setup.

Collector OOMKilled

bashkubectl top pod -n observability -l app.kubernetes.io/name=opentelemetry-collector

The memory_limiter processor only works if placed before batch in the chain. If the order is reversed, batch accumulates spans in memory before memory_limiter can reject them. Increase limit_mib or lower send_batch_size.

Service map not populating

Grafana's service map in Tempo uses span attributes to draw edges between services. The service.name attribute must be set consistently across all services:

yamlenv:
  - name: OTEL_SERVICE_NAME
    value: "my-service"  # must be unique and stable per service

If two services share a name, their spans merge in the service map.


Summary

  • OTel replaces vendor APM agents: one SDK, one protocol (OTLP), any backend
  • OTel Collector is the routing layer — change backends without touching application code
  • Grafana stack (Tempo + Prometheus + Loki) replaces Elastic APM + Kibana at zero licensing cost
  • Trace-to-log correlation is datasource config in Grafana, not a vendor feature
  • Run dual exporters during migration to validate Tempo before cutting over
  • memory_limiter before batch in the processor chain — order matters