Full observability on one VPS: metrics, logs, alerts, dashboards — 452 MiB requested
Published: 2026-06-13
Over the last week this k0s cluster grew a complete observability stack: VictoriaMetrics for metrics, VictoriaLogs for logs, Promtail shipping Traefik access logs, vmalert firing into Telegram, and Grafana drawing it all. Each piece got its own post; this one is the map — how the parts connect, what the whole thing costs in RAM and CPU on a 2-core VPS that also runs nine websites, a mail server, and six proxies, and which decisions repeat across every component.
The whole picture
┌──────────────────────────────────────────────┐
│ Grafana :3000 │
│ datasources: VictoriaMetrics + VictoriaLogs │
└──────────┬─────────────────────┬─────────────┘
│ PromQL/MetricsQL │ LogsQL
▼ ▼
┌─────────────────────────────┐ ┌──────────────────────┐
│ vmsingle-server :8428 │ │ victoria-logs :9428 │
│ scrape + store, 90d │ │ store, 7d │
└──┬──────────────────────┬───┘ └──────────▲───────────┘
│ scrapes │ rules │ loki push
▼ ▼ │
node-exporter ┌──────────┐ ┌──────┴─────┐
kube-state-metrics │ vmalert │ │ promtail │
traefik :9101 └────┬─────┘ │ DaemonSet │
blackbox-exporter │ └──────▲─────┘
kubelet cadvisor ▼ │ tails
proxy exporters ┌──────────────┐ /var/log/pods/
│ alertmanager │──▶ Telegram traefik_traefik-*
└──────────────┘
Two storage systems, one for each signal type. Both are VictoriaMetrics-family single binaries, both store on hostPath PVs, both are queried by the same Grafana. Everything below them is stateless plumbing.
The resource bill
The numbers that justify the whole architecture, straight from the manifests:
| Component | CPU req | RAM req | CPU limit | RAM limit |
|---|---|---|---|---|
| vmsingle | 50m | 128Mi | 500m | 512Mi |
| victoria-logs | 10m | 32Mi | 200m | 128Mi |
| promtail | 10m | 32Mi | 100m | 64Mi |
| vmalert | 10m | 32Mi | 200m | 128Mi |
| alertmanager | 10m | 32Mi | 100m | 64Mi |
| node-exporter | 10m | 20Mi | 100m | 64Mi |
| kube-state-metrics | 10m | 32Mi | 200m | 128Mi |
| blackbox-exporter | 10m | 16Mi | 100m | 64Mi |
| grafana | 50m | 128Mi | 500m | 512Mi |
| Total | 170m | 452Mi | 2000m | 1664Mi |
452 MiB of requests for nine components. A default kube-prometheus-stack install requests more than that for Prometheus alone. The limits sum to more than the node has — that's fine, they never spike together; requests are what the scheduler reasons about.
Metrics path
vmsingle scrapes everything itself — its embedded scraper takes standard Prometheus scrape_configs, so there is no Prometheus and no operator. Targets: node-exporter, kube-state-metrics, Traefik, kubelet cAdvisor (with a small ClusterRole for the kubelet API), two proxy exporters, itself, vmalert, and three blackbox jobs probing nine sites over HTTP, four proxy ports over TCP, and external DNS for connectivity. Retention 90 days on a 20 GiB hostPath volume.
vmalert evaluates the rules — site down, slow response, SSL expiring in under 14 days, proxy port down, external connectivity degraded, RAM/disk/CPU pressure, disk-will-fill-in-4h prediction, pods crash-looping, replica mismatch — and pushes firing alerts to alertmanager, which delivers to Telegram with a 🔴/✅ template and an inhibit rule so a critical mutes its own warning.
Logs path
Traefik writes JSON access logs to stdout (logs.access.format: json in the chart values). The kubelet lands them in /var/log/pods/traefik_traefik-*/. Promtail tails that glob — no Kubernetes service discovery, a static path — parses the JSON, promotes RequestMethod and DownstreamStatus (and ClientHost, which only VictoriaLogs' cardinality tolerance makes safe) to labels, drops a known noisy error line, and pushes to VictoriaLogs over the Loki protocol. Retention 7 days on a 5 GiB volume: logs answer "what just happened", metrics keep the history.
The split matters: nothing long-lived is derived from logs. Request rates, error percentages, and latency histograms come from Traefik's own Prometheus metrics endpoint, which is cheaper to store and query than computing them from access logs ever would be.
Dashboards
Grafana is provisioned, not clicked together:
- Datasources come from the Helm values — VictoriaMetrics (Prometheus-type, default) and VictoriaLogs via the
victoriametrics-logs-datasourceplugin. - Dashboards are ConfigMaps with the label
grafana_dashboard: "1", picked up by the sidecar. Five of them: node, Kubernetes, Traefik, proxies, VPN. - Anonymous access is viewer-only; editing happens in git.
The Traefik dashboard is where both signals meet: per-service RPS, latency, error rate, and bandwidth panels from metrics — the service label de-hashed at scrape time with a relabel rule — and two logs panels at the bottom fed by LogsQL:
{job="traefik-access"} # live access log
{job="traefik-access"} DownstreamStatus:~"[45][0-9][0-9]" # errors only
Clicking from "error rate went up" to "here are the actual failing requests" without leaving the dashboard is the payoff of the whole week.
Decisions that repeat
Looking back at the four posts, the same few choices show up in every component:
Single-binary over distributed. vmsingle instead of Prometheus+operator, victoria-logs single node instead of a cluster, promtail instead of an agent pipeline. On one node, every coordination layer is pure overhead.
Static config over discovery. Scrape targets are static_configs, the log path is a glob, dashboards and datasources are files in git. Discovery mechanisms earn their complexity when things come and go; here nothing does.
hostPath PV + Retain + explicit volumeName binding. No storage class, no provisioner, data survives helm uninstall, and storageClassName: "" on the PVC keeps it from waiting for a provisioner that doesn't exist.
Bounded memory everywhere. Every component has limits; vmsingle additionally caps its caches with memory.allowedPercent: 20. On a shared node, an unbounded observability stack is the first thing to OOM the workloads it's supposed to watch.
Stable ClusterIP names. Grafana and vmalert talk to vmsingle-stable, a plain ClusterIP in front of the chart's headless Service — headless DNS hands out pod IPs that go stale across CoreDNS restarts.
What can go wrong
The stack monitors everything except itself dying. If the node goes down, vmalert goes down with it and no Telegram alert fires. The external check is Beszel's agent-disconnect alert plus blackbox probes of the public sites — but blackbox also runs on the same node. True dead-node detection needs one prober outside the box; an uptime-checker hitting status.antonnovikov.com covers it.
Grafana shows metrics but logs panels are empty. Three-step diagnosis: does VMUI at victoria-logs:9428/select/vmui return data (storage ok)? Does Promtail's /targets page show the file as active (shipping ok)? Does the dashboard's datasource UID match the provisioned one (wiring ok)? In my case it was the third — pin the UID.
Everything restarts at once after a node reboot. Nine components racing to start on two cores means liveness probes time out and pods restart in waves. Generous initialDelaySeconds on the stateful pods (vmsingle, victoria-logs, grafana) breaks the loop; the stateless ones can thrash harmlessly.
Summary
- Two single-binary databases — VictoriaMetrics for metrics (90d), VictoriaLogs for logs (7d) — cover both signals for 160 MiB of requested RAM between them
- The full nine-component stack requests 452 MiB / 170m CPU and coexists with real workloads on a 2-core VPS
- Promtail bridges the two worlds: Traefik JSON access logs → Loki protocol → VictoriaLogs, queryable next to the metrics in one Grafana
- Alerts flow vmalert → alertmanager → Telegram, with SSL expiry and disk-fill prediction the two that have actually paid for themselves
- Same patterns everywhere: single binary, static config, hostPath+Retain, bounded memory, stable Service names
- The remaining gap is self-monitoring — a prober outside the node