k3s on Two VMs: Cilium, APISIX, and Longhorn for a Test Cluster
Published: 2026-06-21
Single-node k3s is a convenient starting point: one server, the whole cluster, no networking compromises. But the moment you need to verify workload behavior during node failure, storage replication semantics, or ingress behavior during pod drift — single-node stops being an honest test. Moving to two nodes enables exactly those scenarios at minimal cost.
This post covers the key decisions when doing this migration, using the same stack already running on production: Cilium as CNI and load balancer (no MetalLB), APISIX as ingress, Longhorn as distributed storage.
Why Single-Node Becomes Insufficient
Single-node k3s is fine for:
- Development and local configuration validation
- CI/CD pipelines where state doesn't matter
- Deploying stateless applications
It breaks the test as soon as you need to:
- Verify Deployment behavior when a node is killed
- Test ReadWriteMany storage (a PVC mounted on two pods simultaneously)
- Confirm that Cilium L2 correctly switches ARP when a pod migrates
- Test PodDisruptionBudgets under real conditions
Two nodes cover all of these scenarios while keeping the same stack as production.
Cluster Topology
For a test cluster, the optimal setup is 1 server + 1 agent:
┌─────────────────────────────────────────────────────────┐
│ VM1 (server) VM2 (agent) │
│ 192.168.1.10 192.168.1.11 │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ kube-apiserver │ │ │ │
│ │ kube-scheduler │◄────────────►│ kubelet │ │
│ │ kube-controller │ 6443/TCP │ Cilium agent │ │
│ │ etcd │ │ (kube-proxy │ │
│ │ kubelet │ │ replacement) │ │
│ │ Cilium agent │ │ │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
│ ← — — — — — — — — L2 network / same subnet — — — — → │
└─────────────────────────────────────────────────────────┘
The server node acts as both control plane and worker. The agent is worker-only. For a test cluster this is fine — on production you'd separate them with taints.
Network requirements:
- Both nodes on the same L2 network (required for Cilium L2 Announcements)
- Port
6443/TCPopen from agent to server - Port
4240/TCP(Cilium health check) open between nodes - For Longhorn — port
9500/TCPbetween nodes
Installing k3s
Node Preparation
On both nodes before installation:
bash# Disable swap — mandatory for kubelet
swapoff -a
sed -i '/ swap / s/^/#/' /etc/fstab
# Load required kernel modules
modprobe overlay
modprobe br_netfilter
cat <<EOF > /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF
# sysctl for Kubernetes networking
cat <<EOF > /etc/sysctl.d/99-kubernetes.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0
fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 512
EOF
sysctl --system
rp_filter=0 is required for Cilium L2 Announcements. In strict mode (1), the kernel drops load balancer response packets because they leave on a different interface than they arrived on.
Installing the Server Node
k3s starts without Flannel and without kube-proxy — both are replaced by Cilium:
bashcurl -sfL https://get.k3s.io | sh -s - server \
--flannel-backend=none \
--disable-kube-proxy \
--disable=traefik \
--disable=servicelb \
--node-ip=192.168.1.10 \
--advertise-address=192.168.1.10
Flags:
--flannel-backend=none— don't install Flannel CNI; Cilium takes this role--disable-kube-proxy— Cilium replaces kube-proxy via eBPF--disable=traefik— remove default ingress, install APISIX instead--disable=servicelb— remove built-in Klipper LB; Cilium announces LoadBalancer IPs--node-ipand--advertise-address— explicitly set the IP, otherwise k3s may pick the wrong interface
Use config.yaml instead of CLI flags — easier to update:
yaml# /etc/rancher/k3s/config.yaml (on server node)
flannel-backend: "none"
disable-kube-proxy: true
disable:
- traefik
- servicelb
node-ip: "192.168.1.10"
advertise-address: "192.168.1.10"
After starting, the node will be in NotReady — expected until Cilium is installed:
bashk3s kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# vm1 NotReady control-plane,master 30s v1.32.x+k3s1
Do not wait for Ready before installing Cilium — Cilium will transition the node to Ready after it starts.
Get Token and kubeconfig
bash# Token for joining the agent node
cat /var/lib/rancher/k3s/server/node-token
# kubeconfig for managing from a local machine
cat /etc/rancher/k3s/k3s.yaml | sed 's/127.0.0.1/192.168.1.10/' > ~/k3s-test.yaml
export KUBECONFIG=~/k3s-test.yaml
Install Cilium (Before Joining Agent)
Cilium must be installed before joining the agent node — otherwise the agent will hang in NotReady:
bashhelm repo add cilium https://helm.cilium.io/
helm install cilium cilium/cilium \
--namespace kube-system \
--set kubeProxyReplacement=true \
--set k8sServiceHost=192.168.1.10 \
--set k8sServicePort=6443 \
--set ipam.mode=kubernetes \
--set operator.replicas=1 \
--set socketLB.enabled=true \
--set socketLB.hostNamespaceOnly=true \
--set nodePort.enabled=true \
--set hostPort.enabled=true \
--set l2announcements.enabled=true \
--set hubble.enabled=true \
--set hubble.relay.enabled=true \
--set hubble.ui.enabled=true
k8sServiceHost must be the real IP of the server node, not 127.0.0.1. Cilium running without kube-proxy needs to find the API server itself; localhost creates a routing loop here.
operator.replicas=1 — for a two-node cluster, so the Cilium operator doesn't wait for a second node for HA.
After installation, the server node transitions to Ready:
bashkubectl get nodes
# NAME STATUS ROLES AGE VERSION
# vm1 Ready control-plane,master 3m v1.32.x+k3s1
cilium status
# KubeProxyReplacement: True
# Cilium: 2/2 agents running
Joining the Agent Node
bash# On VM2
curl -sfL https://get.k3s.io | K3S_URL=https://192.168.1.10:6443 \
K3S_TOKEN="$(cat /var/lib/rancher/k3s/server/node-token)" \
sh -s - agent \
--node-ip=192.168.1.11
Or via config.yaml:
yaml# /etc/rancher/k3s/config.yaml (on agent node)
server: "https://192.168.1.10:6443"
token: "K107abc...::server:abc123token..."
node-ip: "192.168.1.11"
After joining, Cilium automatically deploys an agent on VM2:
bashkubectl get nodes
# NAME STATUS ROLES AGE VERSION
# vm1 Ready control-plane,master 5m v1.32.x+k3s1
# vm2 Ready <none> 30s v1.32.x+k3s1
kubectl -n kube-system get pods -l app.kubernetes.io/name=cilium
# NAME READY STATUS RESTARTS
# cilium-xxxxx 1/1 Running 0 vm1
# cilium-yyyyy 1/1 Running 0 vm2
Load Balancer: Cilium L2 Announcements
Since CNI is already Cilium, no separate MetalLB is needed. Cilium announces LoadBalancer IPs via ARP on the L2 network.
IP Address Pool
yaml# cilium-lb.yaml
apiVersion: "cilium.io/v2alpha1"
kind: CiliumLoadBalancerIPPool
metadata:
name: default-pool
spec:
blocks:
- cidr: "192.168.1.100/28" # 14 addresses: .101 — .114
---
apiVersion: "cilium.io/v2alpha1"
kind: CiliumL2AnnouncementPolicy
metadata:
name: default-policy
spec:
interfaces:
- ^eth[0-9]+ # all eth interfaces
loadBalancerIPs: true
externalIPs: true
bashkubectl apply -f cilium-lb.yaml
The pool CIDR must be in the same subnet as the node interfaces — otherwise the switch won't accept ARP replies. Choose a range outside your router's DHCP pool.
Pin a Specific IP to a Service
yamlmetadata:
annotations:
"lbipam.cilium.io/ips": "192.168.1.101"
Verify
bash# Test service
kubectl create deployment nginx --image=nginx
kubectl expose deployment nginx --port=80 --type=LoadBalancer
kubectl get svc nginx
# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
# nginx LoadBalancer 10.43.12.5 192.168.1.101 80:30080/TCP
# Cilium announces the IP via ARP
cilium l2announce list
# Reachable from any machine on LAN
curl http://192.168.1.101/
Cilium elects one leader node to respond to ARP for each IP. If that node goes down, another takes over within a few seconds (controlled by leaseDuration in the policy).
Why Not MetalLB
MetalLB works with any CNI and is a solid standalone component. But when CNI is already Cilium — MetalLB duplicates functionality, adds another Helm chart, and another set of CRDs. Cilium L2 Announcements solves the same problem within the already-installed component, with Hubble providing traffic observability on top.
Ingress: APISIX
k3s runs with --disable=traefik. APISIX Ingress Controller replaces it — the same one running in production.
bashhelm repo add apisix https://charts.apiseven.com
helm install apisix apisix/apisix \
--namespace apisix \
--create-namespace \
--set service.type=LoadBalancer \
--set ingress-controller.enabled=true \
--set ingress-controller.config.apisix.serviceNamespace=apisix
service.type=LoadBalancer — APISIX immediately requests an external IP from the Cilium pool.
After installation:
bashkubectl -n apisix get svc apisix-gateway
# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
# apisix-gateway LoadBalancer 10.43.5.10 192.168.1.102 80:30080/TCP,443:30443/TCP
Example ApisixRoute:
yamlapiVersion: apisix.apache.org/v2
kind: ApisixRoute
metadata:
name: my-app
namespace: default
spec:
http:
- name: main
match:
hosts:
- myapp.example.com
paths:
- "/*"
backends:
- serviceName: my-app
servicePort: 8080
StorageClass: Longhorn
The Problem with local-path in Multi-Node
k3s installs local-path StorageClass by default. On single-node — it works. On two nodes — it creates an invisible trap.
local-path creates PVs on the node where the pod was scheduled. If the pod moves to another node — the PV stays on the first one, and the pod hangs:
Events:
Warning FailedScheduling pod/my-app 0/2 nodes are available:
1 node(s) had volume node affinity conflict.
This never surfaces on single-node. On two nodes, it appears on any kubectl drain or automatic reschedule.
Keep local-path only for truly ephemeral data that can be recreated from scratch.
Longhorn: Distributed Block Storage
Longhorn stores data on the local disks of both nodes and replicates between them:
┌──────────────────────┐ ┌──────────────────────┐
│ VM1 (server) │ │ VM2 (agent) │
│ │ │ │
│ /var/lib/longhorn │◄───►│ /var/lib/longhorn │
│ Replica A │ │ Replica B │
│ │ │ │
└──────────────────────┘ └──────────────────────┘
If VM2 goes down — data remains on VM1, the pod keeps running with one replica. When VM2 recovers, Longhorn automatically rebuilds the second replica.
Prepare Nodes
bash# On both nodes
apt install open-iscsi util-linux nfs-common
modprobe iscsi_tcp
echo 'iscsi_tcp' >> /etc/modules-load.d/iscsi.conf
systemctl enable --now iscsid
Install
bashhelm repo add longhorn https://charts.longhorn.io
helm install longhorn longhorn/longhorn \
--namespace longhorn-system \
--create-namespace \
--set defaultSettings.defaultReplicaCount=2 \
--set defaultSettings.storageMinimalAvailablePercentage=10
Verify:
bashkubectl -n longhorn-system get node.longhorn.io
# NAME READY ALLOWSCHEDULING SCHEDULABLE AGE
# vm1 True True True 5m
# vm2 True True True 4m
kubectl get storageclass
# NAME PROVISIONER ...
# longhorn (default) driver.longhorn.io ...
# local-path rancher.io/local-path ...
Usage
yamlapiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data
spec:
accessModes: [ReadWriteOnce]
storageClassName: longhorn
resources:
requests:
storage: 10Gi
Longhorn UI via port-forward or ApisixRoute:
bashkubectl -n longhorn-system port-forward svc/longhorn-frontend 8080:80
# http://localhost:8080 — volumes, replicas, node state
If You Need ReadWriteMany
Longhorn supports only ReadWriteOnce out of the box. For ReadWriteMany — NFS Provisioner on top of an NFS server:
bash# NFS server on VM1
apt install nfs-kernel-server
mkdir -p /srv/nfs/k8s
echo "/srv/nfs/k8s 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash)" >> /etc/exports
exportfs -ra
systemctl enable --now nfs-kernel-server
# NFS client on VM2
apt install nfs-common
# Provisioner in cluster
helm repo add nfs-subdir-external-provisioner \
https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm install nfs-provisioner \
nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
--namespace nfs-provisioner \
--create-namespace \
--set nfs.server=192.168.1.10 \
--set nfs.path=/srv/nfs/k8s \
--set storageClass.name=nfs \
--set storageClass.defaultClass=false
NFS doesn't replicate data — it's not a replacement for Longhorn for stateful workloads. Use NFS only where RWX is genuinely required: shared uploads, config shared across replicas.
Final Stack
| Component | Choice |
|---|---|
| k3s | server + agent, --flannel-backend=none --disable-kube-proxy |
| CNI | Cilium (kube-proxy replacement) |
| Load Balancer | Cilium L2 Announcements + CiliumLoadBalancerIPPool |
| Ingress | APISIX Ingress Controller |
| StorageClass / block | Longhorn (2 replicas) |
| StorageClass / shared | NFS Provisioner (if RWX is needed) |
| Network observability | Hubble (built into Cilium) |
Deployment order:
bash# 1. Prepare both nodes (sysctl, swap, kernel modules, open-iscsi)
# 2. Install k3s server (with config.yaml)
# 3. Install Cilium via Helm
# 4. Join k3s agent
# 5. Create CiliumLoadBalancerIPPool and CiliumL2AnnouncementPolicy
# 6. Install Longhorn
# 7. Install APISIX
# 8. Verify: kubectl get nodes, cilium status, kubectl get storageclass
What Can Go Wrong
Node Stays NotReady After Agent Joins
Cilium didn't come up on the agent node. Check:
bashkubectl -n kube-system logs -l app.kubernetes.io/name=cilium --tail=50
kubectl describe node vm2 | grep -A10 Conditions
Common cause: Cilium on the server node wasn't fully Ready when the agent joined. Wait for Cilium to be fully running on the server before running the k3s-agent install.
Cilium L2: IP Assigned but Unreachable from LAN
bash# Check that L2 policy is applied
kubectl get ciliuml2announcementpolicies
# Check active announcements
cilium l2announce list
# Check rp_filter — must be 0
sysctl net.ipv4.conf.eth0.rp_filter
If rp_filter=1 — response packets are dropped. Fix:
bashsysctl -w net.ipv4.conf.all.rp_filter=0
sysctl -w net.ipv4.conf.default.rp_filter=0
Persist in /etc/sysctl.d/99-kubernetes.conf so it survives reboots.
APISIX Not Getting IP from Cilium
bashkubectl -n apisix get svc apisix-gateway
# EXTERNAL-IP: <pending>
kubectl -n apisix describe svc apisix-gateway
# Check if pool is exhausted
kubectl get ciliumloadbalancerippool default-pool -o yaml | grep -A5 status
Common cause: pool exhausted or the pool CIDR is not in the same subnet as the nodes.
Longhorn Replica Not Creating on VM2
bashkubectl -n longhorn-system get volume
# STATE: degraded
kubectl -n longhorn-system get node.longhorn.io vm2 -o yaml | grep -A5 conditions
systemctl status iscsid
df -h /var/lib/longhorn
Causes: iscsid not running, insufficient disk space, or node marked allowScheduling: false in Longhorn.
Pod Hangs When Moving to Another Node (local-path)
Expected behavior — the PV is bound to the original node via node affinity. Migrate the PVC to storageClassName: longhorn.
bash# Inspect the node affinity on an existing PV
kubectl get pv <pv-name> -o jsonpath='{.spec.nodeAffinity}'
Summary
- k3s server starts with
--flannel-backend=none --disable-kube-proxy --disable=traefik,servicelb - Cilium must be installed before joining the agent, otherwise the agent hangs in NotReady
k8sServiceHostin Cilium must be the real node IP, not127.0.0.1rp_filter=0is mandatory for L2 Announcements — without it, LB response packets are silently dropped- Cilium L2 Announcements replaces MetalLB when Cilium is the CNI — no extra Helm chart needed
- APISIX gets a LoadBalancer IP from the Cilium pool automatically
local-pathin multi-node breaks pods when they move to another node — use Longhorn instead- Longhorn with 2 replicas survives single-node loss without data loss