Full observability on one VPS: metrics, logs, alerts, dashboards — 452 MiB requested

Published: 2026-06-13

Over the last week this k0s cluster grew a complete observability stack: VictoriaMetrics for metrics, VictoriaLogs for logs, Promtail shipping Traefik access logs, vmalert firing into Telegram, and Grafana drawing it all. Each piece got its own post; this one is the map — how the parts connect, what the whole thing costs in RAM and CPU on a 2-core VPS that also runs nine websites, a mail server, and six proxies, and which decisions repeat across every component.


The whole picture

                    ┌──────────────────────────────────────────────┐
                    │                 Grafana :3000                │
                    │  datasources: VictoriaMetrics + VictoriaLogs │
                    └──────────┬─────────────────────┬─────────────┘
                               │ PromQL/MetricsQL    │ LogsQL
                               ▼                     ▼
        ┌─────────────────────────────┐   ┌──────────────────────┐
        │   vmsingle-server :8428     │   │  victoria-logs :9428 │
        │   scrape + store, 90d       │   │  store, 7d           │
        └──┬──────────────────────┬───┘   └──────────▲───────────┘
           │ scrapes              │ rules            │ loki push
           ▼                      ▼                  │
  node-exporter            ┌──────────┐       ┌──────┴─────┐
  kube-state-metrics       │ vmalert  │       │  promtail  │
  traefik :9101            └────┬─────┘       │ DaemonSet  │
  blackbox-exporter             │             └──────▲─────┘
  kubelet cadvisor              ▼                    │ tails
  proxy exporters        ┌──────────────┐    /var/log/pods/
                         │ alertmanager │──▶ Telegram   traefik_traefik-*
                         └──────────────┘

Two storage systems, one for each signal type. Both are VictoriaMetrics-family single binaries, both store on hostPath PVs, both are queried by the same Grafana. Everything below them is stateless plumbing.

The resource bill

The numbers that justify the whole architecture, straight from the manifests:

Component CPU req RAM req CPU limit RAM limit
vmsingle 50m 128Mi 500m 512Mi
victoria-logs 10m 32Mi 200m 128Mi
promtail 10m 32Mi 100m 64Mi
vmalert 10m 32Mi 200m 128Mi
alertmanager 10m 32Mi 100m 64Mi
node-exporter 10m 20Mi 100m 64Mi
kube-state-metrics 10m 32Mi 200m 128Mi
blackbox-exporter 10m 16Mi 100m 64Mi
grafana 50m 128Mi 500m 512Mi
Total 170m 452Mi 2000m 1664Mi

452 MiB of requests for nine components. A default kube-prometheus-stack install requests more than that for Prometheus alone. The limits sum to more than the node has — that's fine, they never spike together; requests are what the scheduler reasons about.

Metrics path

vmsingle scrapes everything itself — its embedded scraper takes standard Prometheus scrape_configs, so there is no Prometheus and no operator. Targets: node-exporter, kube-state-metrics, Traefik, kubelet cAdvisor (with a small ClusterRole for the kubelet API), two proxy exporters, itself, vmalert, and three blackbox jobs probing nine sites over HTTP, four proxy ports over TCP, and external DNS for connectivity. Retention 90 days on a 20 GiB hostPath volume.

vmalert evaluates the rules — site down, slow response, SSL expiring in under 14 days, proxy port down, external connectivity degraded, RAM/disk/CPU pressure, disk-will-fill-in-4h prediction, pods crash-looping, replica mismatch — and pushes firing alerts to alertmanager, which delivers to Telegram with a 🔴/✅ template and an inhibit rule so a critical mutes its own warning.

Logs path

Traefik writes JSON access logs to stdout (logs.access.format: json in the chart values). The kubelet lands them in /var/log/pods/traefik_traefik-*/. Promtail tails that glob — no Kubernetes service discovery, a static path — parses the JSON, promotes RequestMethod and DownstreamStatus (and ClientHost, which only VictoriaLogs' cardinality tolerance makes safe) to labels, drops a known noisy error line, and pushes to VictoriaLogs over the Loki protocol. Retention 7 days on a 5 GiB volume: logs answer "what just happened", metrics keep the history.

The split matters: nothing long-lived is derived from logs. Request rates, error percentages, and latency histograms come from Traefik's own Prometheus metrics endpoint, which is cheaper to store and query than computing them from access logs ever would be.

Dashboards

Grafana is provisioned, not clicked together:

  • Datasources come from the Helm values — VictoriaMetrics (Prometheus-type, default) and VictoriaLogs via the victoriametrics-logs-datasource plugin.
  • Dashboards are ConfigMaps with the label grafana_dashboard: "1", picked up by the sidecar. Five of them: node, Kubernetes, Traefik, proxies, VPN.
  • Anonymous access is viewer-only; editing happens in git.

The Traefik dashboard is where both signals meet: per-service RPS, latency, error rate, and bandwidth panels from metrics — the service label de-hashed at scrape time with a relabel rule — and two logs panels at the bottom fed by LogsQL:

{job="traefik-access"}                                    # live access log
{job="traefik-access"} DownstreamStatus:~"[45][0-9][0-9]" # errors only

Clicking from "error rate went up" to "here are the actual failing requests" without leaving the dashboard is the payoff of the whole week.

Decisions that repeat

Looking back at the four posts, the same few choices show up in every component:

Single-binary over distributed. vmsingle instead of Prometheus+operator, victoria-logs single node instead of a cluster, promtail instead of an agent pipeline. On one node, every coordination layer is pure overhead.

Static config over discovery. Scrape targets are static_configs, the log path is a glob, dashboards and datasources are files in git. Discovery mechanisms earn their complexity when things come and go; here nothing does.

hostPath PV + Retain + explicit volumeName binding. No storage class, no provisioner, data survives helm uninstall, and storageClassName: "" on the PVC keeps it from waiting for a provisioner that doesn't exist.

Bounded memory everywhere. Every component has limits; vmsingle additionally caps its caches with memory.allowedPercent: 20. On a shared node, an unbounded observability stack is the first thing to OOM the workloads it's supposed to watch.

Stable ClusterIP names. Grafana and vmalert talk to vmsingle-stable, a plain ClusterIP in front of the chart's headless Service — headless DNS hands out pod IPs that go stale across CoreDNS restarts.

What can go wrong

The stack monitors everything except itself dying. If the node goes down, vmalert goes down with it and no Telegram alert fires. The external check is Beszel's agent-disconnect alert plus blackbox probes of the public sites — but blackbox also runs on the same node. True dead-node detection needs one prober outside the box; an uptime-checker hitting status.antonnovikov.com covers it.

Grafana shows metrics but logs panels are empty. Three-step diagnosis: does VMUI at victoria-logs:9428/select/vmui return data (storage ok)? Does Promtail's /targets page show the file as active (shipping ok)? Does the dashboard's datasource UID match the provisioned one (wiring ok)? In my case it was the third — pin the UID.

Everything restarts at once after a node reboot. Nine components racing to start on two cores means liveness probes time out and pods restart in waves. Generous initialDelaySeconds on the stateful pods (vmsingle, victoria-logs, grafana) breaks the loop; the stateless ones can thrash harmlessly.

Summary

  • Two single-binary databases — VictoriaMetrics for metrics (90d), VictoriaLogs for logs (7d) — cover both signals for 160 MiB of requested RAM between them
  • The full nine-component stack requests 452 MiB / 170m CPU and coexists with real workloads on a 2-core VPS
  • Promtail bridges the two worlds: Traefik JSON access logs → Loki protocol → VictoriaLogs, queryable next to the metrics in one Grafana
  • Alerts flow vmalert → alertmanager → Telegram, with SSL expiry and disk-fill prediction the two that have actually paid for themselves
  • Same patterns everywhere: single binary, static config, hostPath+Retain, bounded memory, stable Service names
  • The remaining gap is self-monitoring — a prober outside the node