k3s with Cilium kube-proxy replacement and L2 LoadBalancer
Published: 2026-02-06
Running k3s without Flannel and without kube-proxy sounds drastic but Cilium replaces both with eBPF programs that outperform their iptables equivalents. Combined with Cilium L2 announcements, you get a complete networking stack with no MetalLB and no external load balancer.
This post covers the full setup: k3s flags, Cilium Helm values, LBIPPool configuration, L2 announcement policies, BGP as an alternative, debugging tools, and common pitfalls.
Why replace kube-proxy
kube-proxy writes iptables rules for every Service and Endpoint in the cluster. At scale (100+ services, 1000+ endpoints), iptables chains become enormous and each packet traverses all of them. Cilium's eBPF kube-proxy replacement does O(1) lookups via hash maps.
Benefits we see:
- Service routing doesn't degrade as the cluster grows
conntracktable pressure is lower- Fewer dropped connections under traffic spikes
conntracktimeouts don't affect established eBPF-tracked connections the same way
The performance difference becomes noticeable above ~200 services. Below that, both approaches work fine. The main reason to switch early is to avoid the migration later when the cluster is large and critical.
k3s flags required
bashk3s server \
--flannel-backend=none \
--disable-kube-proxy \
--disable servicelb \
--disable traefik
--flannel-backend=none tells k3s not to install any CNI — Cilium fills that role.
--disable-kube-proxy tells k3s not to start the kube-proxy process.
--disable servicelb removes k3s's built-in LoadBalancer controller (Klipper).
--disable traefik removes the default ingress; we use APISIX deployed via Helm.
Order of operations matters: k3s starts, but the node stays NotReady until Cilium is installed. This is expected — don't wait for Ready before installing Cilium.
Cilium Helm values for k3s
yamlkubeProxyReplacement: true
k8sServiceHost: 192.168.1.10 # control-plane IP, not 127.0.0.1
k8sServicePort: 6443
ipam:
mode: kubernetes
operator:
replicas: 1
socketLB:
enabled: true
hostNamespaceOnly: true
nodePort:
enabled: true
hostPort:
enabled: true
# L2 announcements (MetalLB replacement)
l2announcements:
enabled: true
# Hubble observability
hubble:
enabled: true
relay:
enabled: true
ui:
enabled: true
k8sServiceHost must point to the real node IP, not 127.0.0.1. If Cilium starts without kube-proxy and tries to reach the API server via localhost, it loops. Use ansible_default_ipv4.address from your Ansible inventory.
socketLB.hostNamespaceOnly: true limits socket-level load balancing to the host namespace. This is important on nodes that run other services (like monitoring agents) that shouldn't have their traffic intercepted by Cilium's socket LB.
Multi-node cluster values
For multi-node clusters, adjust:
yamloperator:
replicas: 2 # HA for the Cilium operator
tunnel: disabled # Use native routing instead of VXLAN for better performance
autoDirectNodeRoutes: true # Nodes route to each other directly
# Per-node IPAM
ipam:
mode: kubernetes
# Enable bandwidth management (optional, requires kernel 5.1+)
bandwidthManager:
enabled: true
Native routing requires that nodes can reach each other at the pod CIDR level — usually true on on-prem flat networks and on cloud providers with VPC routing.
LBIPPool: assigning external IPs
Cilium L2 announcements require a pool of IPs to allocate from. Create a CiliumLoadBalancerIPPool:
yamlapiVersion: "cilium.io/v2alpha1"
kind: CiliumLoadBalancerIPPool
metadata:
name: default-pool
spec:
blocks:
- cidr: "192.168.1.100/28" # 16 addresses for LoadBalancer Services
The CIDR must be on the same subnet as the node interfaces so ARP can resolve it. If you put it on a different subnet, L2 announcements won't work — ARP replies won't be accepted by the network switch.
You can have multiple pools with different CIDRs:
yamlspec:
blocks:
- cidr: "192.168.1.100/28" # 14 usable: .101 - .114
- cidr: "192.168.1.200/30" # 2 usable: .201, .202 (for critical services)
To pin a specific Service to a specific IP from the pool:
yamlmetadata:
annotations:
"lbipam.cilium.io/ips": "192.168.1.101"
L2AnnouncementPolicy: which interfaces to announce on
yamlapiVersion: "cilium.io/v2alpha1"
kind: CiliumL2AnnouncementPolicy
metadata:
name: default-policy
spec:
interfaces:
- ^eth[0-9]+
externalIPs: true
loadBalancerIPs: true
A regex on interfaces means the policy applies to any eth0, eth1, etc. Without this policy no announcement happens and external IPs stay unreachable.
The policy is cluster-wide — all nodes participate in ARP for the announced IPs. Cilium elects a leader node per IP to respond to ARP requests. If that node goes down, another node takes over within seconds (controlled by the leaseDuration in the policy).
For more control, scope the policy to specific node labels:
yamlspec:
nodeSelector:
matchLabels:
role: edge-node
interfaces:
- ^eth0
loadBalancerIPs: true
This restricts L2 announcements to nodes labeled role: edge-node, useful if only certain nodes are on the external network.
Verifying it works
After deploying a Service of type: LoadBalancer:
bash# Check IP was assigned
kubectl get svc my-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
# Verify Cilium is announcing it
cilium l2announce list
# Check which node is the current leader for an IP
cilium l2announce get 192.168.1.100
# Confirm the IP responds from another machine on the LAN
ping 192.168.1.100
curl http://192.168.1.100/healthz
If the IP is assigned but not reachable:
bash# Check Cilium agent logs for L2 announcement errors
kubectl -n kube-system logs -l app.kubernetes.io/name=cilium | grep -i "l2announce\|arp"
# Check the policy is applied
kubectl get ciliuml2announcementpolicies
# Run ARP from the gateway/switch to verify
arping -I eth0 192.168.1.100
Common pitfall: rp_filter
If external traffic hits the load balancer IP but response packets are dropped, check reverse path filtering:
bashsysctl net.ipv4.conf.eth0.rp_filter
It must be 0 or 2 (loose). Strict mode (1) causes Cilium to drop return packets because they exit a different interface than they arrived on. Set it via sysctl in your Ansible playbook so it persists across reboots:
yaml- { key: net.ipv4.conf.all.rp_filter, value: "0" }
- { key: net.ipv4.conf.default.rp_filter, value: "0" }
Debugging with the Cilium CLI
The cilium CLI (installed on each node or run via kubectl exec) is the primary debugging tool:
bash# Check Cilium status and health
cilium status --verbose
# Monitor live traffic flows (like tcpdump for Cilium)
cilium monitor --type drop # only show dropped packets
cilium monitor --type l7 # L7 HTTP flows
# Check eBPF map for a Service
cilium service list
cilium service get <id>
# Endpoint connectivity test
cilium connectivity test
# Check BPF programs loaded
bpftool prog list | grep cilium
Hubble is the higher-level observability tool:
bash# Real-time flow view
hubble observe --follow
# Filter by namespace
hubble observe --namespace default --follow
# Show dropped flows only
hubble observe --verdict DROPPED
# HTTP flows for a specific pod
hubble observe --pod frontend --protocol HTTP
kube-proxy vs Cilium eBPF
| kube-proxy | Cilium eBPF | |
|---|---|---|
| Service routing | iptables DNAT chains | eBPF sockmap / hash |
| Scalability | O(n) rules | O(1) lookup |
| LoadBalancer IPs | needs MetalLB/cloud | Cilium L2 announcements |
| Observability | none | Hubble flows |
| Node port | iptables | eBPF NodePort |
| Connection tracking | kernel conntrack | eBPF CT (bypasses conntrack) |
| Network policies | iptables | eBPF (more expressive) |
The operational tradeoff: when something goes wrong, you debug with cilium CLI instead of iptables -L, which is more pleasant. The eBPF stack has fewer moving parts and the tooling is better.
The main reason not to switch: if your team is deeply familiar with iptables debugging and has existing tooling around it, the learning curve for eBPF-based debugging is real. Budget time for it.