OpenTelemetry as APM replacement: one SDK, any backend
Published: 2026-06-11
APM tools — Elastic APM, Datadog, Dynatrace, New Relic — sell the same product: traces, metrics, and logs from your app, visualized in their UI. The cost is vendor lock-in: proprietary agents, proprietary protocols, and pricing that scales with traffic. OpenTelemetry breaks that model. One SDK, one exporter config, and you can route signals to any backend — or several at once. Migrating away from a vendor stops being a two-week refactoring job and becomes a config change in the OTel Collector.
Why vendor APM creates lock-in
A Datadog agent instruments your Go service with dd-trace-go. A New Relic agent uses newrelic-go-agent. An Elastic APM agent uses go-elastic-apm. These are different packages with different APIs. When you switch vendors, you rewrite instrumentation code.
The deeper lock-in is the protocol. Datadog agents speak the Datadog API. New Relic agents speak the New Relic API. Even if the UI is worse, the sunk cost of re-instrumenting 50 services keeps teams on the original vendor for years.
OpenTelemetry is the CNCF-graduated standard for observability signals. The SDK is vendor-neutral. The protocol (OTLP) is vendor-neutral. Instrumentation code written against OTel works with every backend that speaks OTLP — Jaeger, Grafana Tempo, Zipkin, Honeycomb, Datadog (it accepts OTLP too), Elastic, and dozens more.
What OpenTelemetry gives you
Three signal types, one SDK:
- Traces — spans representing work done by your service, linked into a trace tree across services
- Metrics — counters, histograms, gauges; same data Prometheus scrapes but with distributed context
- Logs — structured log entries correlated to traces via
trace_idandspan_id
The OTel Collector is the routing layer. Apps send all signals to it over OTLP. The Collector fans out to whichever backends you run. You can send traces to Tempo and Jaeger simultaneously during a migration. You can filter, transform, and batch signals before they reach backends.
Architecture: replacing Elastic APM with Grafana stack
Before:
App (elastic-apm-agent) → APM Server → Elasticsearch → Kibana APM
After:
App (OTel SDK) → otel-collector → Grafana Tempo (traces)
→ Prometheus (metrics)
→ Loki (logs)
Grafana becomes the single UI for all three signals, with correlation built in: click a trace span, see the logs that fired during that span, pivot to the latency histogram for that service. The APM Server and its Elasticsearch index overhead go away. Tempo stores traces efficiently in object storage. Loki stores logs cheaply.
OTel Collector: fan-out pipeline
A single Collector deployment in the observability namespace handles the whole cluster:
yamlapiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: otel-collector
namespace: observability
spec:
chart:
spec:
chart: opentelemetry-collector
version: "0.x"
sourceRef:
kind: HelmRepository
name: open-telemetry
namespace: flux-system
values:
mode: deployment
replicaCount: 2
config:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
memory_limiter:
limit_mib: 512
batch:
timeout: 5s
send_batch_size: 1000
resource:
attributes:
- key: k8s.cluster.name
value: "prod"
action: upsert
exporters:
otlp/tempo:
endpoint: "tempo.observability.svc.cluster.local:4317"
tls:
insecure: true
prometheusremotewrite:
endpoint: "http://vmsingle.observability.svc.cluster.local:8428/api/v1/write"
loki:
endpoint: "http://loki.observability.svc.cluster.local:3100/loki/api/v1/push"
default_labels_enabled:
exporter: false
job: true
labels:
resource:
service.name: "service_name"
k8s.namespace.name: "namespace"
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [loki]
The resource processor stamps every signal with k8s.cluster.name. Useful when multiple clusters feed the same backend — you can filter by cluster in Grafana without changing app code.
The batch processor is not optional. Without it, each span is a separate gRPC call. At 1000 requests/second a service generates thousands of spans/second; batching them reduces Collector CPU by an order of magnitude. Note the order: memory_limiter must come before batch — if it comes after, batches are already assembled in memory when the limit is hit.
Grafana Tempo for traces
Tempo replaces the Kibana APM trace view. For on-prem clusters, use local storage or MinIO:
yamlapiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: tempo
namespace: observability
spec:
chart:
spec:
chart: tempo
version: "1.x"
sourceRef:
kind: HelmRepository
name: grafana
namespace: flux-system
values:
tempo:
storage:
trace:
backend: local
local:
path: /var/tempo/traces
persistence:
enabled: true
size: 20Gi
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
For production with S3 or MinIO:
yamltempo:
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: minio.observability.svc.cluster.local:9000
access_key: "${MINIO_ACCESS_KEY}"
secret_key: "${MINIO_SECRET_KEY}"
insecure: true
Tempo 2.x adds TraceQL — a query language for trace search that Grafana 10+ uses natively. Pin to tempo >= 1.7.x to get Tempo 2.x.
Trace-to-log correlation in Grafana
The datasource provisioning config links trace spans to Loki log queries:
yamlapiVersion: 1
datasources:
- name: Tempo
type: tempo
url: http://tempo.observability.svc.cluster.local:3100
jsonData:
tracesToLogsV2:
datasourceUid: loki
spanStartTimeShift: "-1m"
spanEndTimeShift: "1m"
filterByTraceID: true
customQuery: true
query: '{service_name="${__span.tags.service.name}"} | trace_id = "${__trace.traceId}"'
serviceMap:
datasourceUid: prometheus
When you click a span in Grafana, it executes the Loki query for that trace_id automatically. This is the correlation that APM vendors charged premium licensing for — here it is just config.
For it to work, your app's logs must carry trace_id. In .NET:
csharpbuilder.Logging.AddOpenTelemetry(logging =>
{
logging.IncludeScopes = true;
logging.AddOtlpExporter(opts =>
opts.Endpoint = new Uri("http://otel-collector.observability.svc.cluster.local:4317"));
});
In Go with the slog bridge:
goimport (
"go.opentelemetry.io/contrib/bridges/otelslog"
"go.opentelemetry.io/otel/sdk/log"
)
loggerProvider := log.NewLoggerProvider(
log.WithProcessor(log.NewBatchProcessor(otlpExporter)),
)
slog.SetDefault(otelslog.NewLogger("my-service", otelslog.WithLoggerProvider(loggerProvider)))
Migration path from Elastic APM
The migration is incremental. The OTel Collector can forward to both Elastic APM Server and Grafana Tempo simultaneously:
yamlexporters:
otlp/tempo:
endpoint: "tempo.observability.svc.cluster.local:4317"
tls:
insecure: true
otlp/elastic:
endpoint: "apm-server.observability.svc.cluster.local:8200"
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/tempo, otlp/elastic] # dual-write during migration
Run both in parallel. Validate that Tempo shows the same traces as Kibana APM. Then remove otlp/elastic from the pipeline and decommission APM Server.
Apps still using Elastic APM agents (not OTel SDK) are the harder part. Options:
- Redirect agent to send OTLP — Elastic APM Java/Node agents 1.40+ support OTLP output mode (
ELASTIC_APM_OPENTELEMETRY_BRIDGE_ENABLED=true) - Keep APM Server temporarily — Elastic APM agents → APM Server → OTel Collector (APM Server has an OTLP exporter)
- Replace the agent — rewrite instrumentation using OTel SDK; usually a day's work per service with auto-instrumentation
Option 1 is the lowest effort for supported agent versions.
What can go wrong
Traces appear in Collector but not in Tempo
Check the exporter:
bashkubectl port-forward -n observability svc/otel-collector 8888:8888
curl http://localhost:8888/metrics | grep otelcol_exporter
# otelcol_exporter_sent_spans should be > 0
# otelcol_exporter_send_failed_spans should be 0
If sent_spans is 0, the pipeline config has an error. Enable debug logging:
yamlenv:
- name: OTEL_LOG_LEVEL
value: debug
Log correlation not working in Grafana
Verify trace_id is present in log entries reaching Loki:
bashkubectl port-forward -n observability svc/loki 3100:3100
curl -G "http://localhost:3100/loki/api/v1/query" \
--data-urlencode 'query={service_name="my-service"} | json | trace_id != ""' \
| jq '.data.result[0].values[0]'
If trace_id is missing, the app is not exporting logs through OTel — it is only exporting traces. Both must use the same OTel SDK setup.
Collector OOMKilled
bashkubectl top pod -n observability -l app.kubernetes.io/name=opentelemetry-collector
The memory_limiter processor only works if placed before batch in the chain. If the order is reversed, batch accumulates spans in memory before memory_limiter can reject them. Increase limit_mib or lower send_batch_size.
Service map not populating
Grafana's service map in Tempo uses span attributes to draw edges between services. The service.name attribute must be set consistently across all services:
yamlenv:
- name: OTEL_SERVICE_NAME
value: "my-service" # must be unique and stable per service
If two services share a name, their spans merge in the service map.
Summary
- OTel replaces vendor APM agents: one SDK, one protocol (OTLP), any backend
- OTel Collector is the routing layer — change backends without touching application code
- Grafana stack (Tempo + Prometheus + Loki) replaces Elastic APM + Kibana at zero licensing cost
- Trace-to-log correlation is datasource config in Grafana, not a vendor feature
- Run dual exporters during migration to validate Tempo before cutting over
memory_limiterbeforebatchin the processor chain — order matters