k3s with Cilium kube-proxy replacement and L2 LoadBalancer

Published: 2026-02-06

Running k3s without Flannel and without kube-proxy sounds drastic but Cilium replaces both with eBPF programs that outperform their iptables equivalents. Combined with Cilium L2 announcements, you get a complete networking stack with no MetalLB and no external load balancer.

This post covers the full setup: k3s flags, Cilium Helm values, LBIPPool configuration, L2 announcement policies, BGP as an alternative, debugging tools, and common pitfalls.


Why replace kube-proxy

kube-proxy writes iptables rules for every Service and Endpoint in the cluster. At scale (100+ services, 1000+ endpoints), iptables chains become enormous and each packet traverses all of them. Cilium's eBPF kube-proxy replacement does O(1) lookups via hash maps.

Benefits we see:

  • Service routing doesn't degrade as the cluster grows
  • conntrack table pressure is lower
  • Fewer dropped connections under traffic spikes
  • conntrack timeouts don't affect established eBPF-tracked connections the same way

The performance difference becomes noticeable above ~200 services. Below that, both approaches work fine. The main reason to switch early is to avoid the migration later when the cluster is large and critical.


k3s flags required

bashk3s server \
  --flannel-backend=none \
  --disable-kube-proxy \
  --disable servicelb \
  --disable traefik

--flannel-backend=none tells k3s not to install any CNI — Cilium fills that role. --disable-kube-proxy tells k3s not to start the kube-proxy process. --disable servicelb removes k3s's built-in LoadBalancer controller (Klipper). --disable traefik removes the default ingress; we use APISIX deployed via Helm.

Order of operations matters: k3s starts, but the node stays NotReady until Cilium is installed. This is expected — don't wait for Ready before installing Cilium.


Cilium Helm values for k3s

yamlkubeProxyReplacement: true
k8sServiceHost: 192.168.1.10   # control-plane IP, not 127.0.0.1
k8sServicePort: 6443

ipam:
  mode: kubernetes

operator:
  replicas: 1

socketLB:
  enabled: true
  hostNamespaceOnly: true

nodePort:
  enabled: true

hostPort:
  enabled: true

# L2 announcements (MetalLB replacement)
l2announcements:
  enabled: true

# Hubble observability
hubble:
  enabled: true
  relay:
    enabled: true
  ui:
    enabled: true

k8sServiceHost must point to the real node IP, not 127.0.0.1. If Cilium starts without kube-proxy and tries to reach the API server via localhost, it loops. Use ansible_default_ipv4.address from your Ansible inventory.

socketLB.hostNamespaceOnly: true limits socket-level load balancing to the host namespace. This is important on nodes that run other services (like monitoring agents) that shouldn't have their traffic intercepted by Cilium's socket LB.

Multi-node cluster values

For multi-node clusters, adjust:

yamloperator:
  replicas: 2          # HA for the Cilium operator

tunnel: disabled       # Use native routing instead of VXLAN for better performance
autoDirectNodeRoutes: true  # Nodes route to each other directly

# Per-node IPAM
ipam:
  mode: kubernetes

# Enable bandwidth management (optional, requires kernel 5.1+)
bandwidthManager:
  enabled: true

Native routing requires that nodes can reach each other at the pod CIDR level — usually true on on-prem flat networks and on cloud providers with VPC routing.


LBIPPool: assigning external IPs

Cilium L2 announcements require a pool of IPs to allocate from. Create a CiliumLoadBalancerIPPool:

yamlapiVersion: "cilium.io/v2alpha1"
kind: CiliumLoadBalancerIPPool
metadata:
  name: default-pool
spec:
  blocks:
    - cidr: "192.168.1.100/28"   # 16 addresses for LoadBalancer Services

The CIDR must be on the same subnet as the node interfaces so ARP can resolve it. If you put it on a different subnet, L2 announcements won't work — ARP replies won't be accepted by the network switch.

You can have multiple pools with different CIDRs:

yamlspec:
  blocks:
    - cidr: "192.168.1.100/28"   # 14 usable: .101 - .114
    - cidr: "192.168.1.200/30"   # 2 usable: .201, .202 (for critical services)

To pin a specific Service to a specific IP from the pool:

yamlmetadata:
  annotations:
    "lbipam.cilium.io/ips": "192.168.1.101"

L2AnnouncementPolicy: which interfaces to announce on

yamlapiVersion: "cilium.io/v2alpha1"
kind: CiliumL2AnnouncementPolicy
metadata:
  name: default-policy
spec:
  interfaces:
    - ^eth[0-9]+
  externalIPs: true
  loadBalancerIPs: true

A regex on interfaces means the policy applies to any eth0, eth1, etc. Without this policy no announcement happens and external IPs stay unreachable.

The policy is cluster-wide — all nodes participate in ARP for the announced IPs. Cilium elects a leader node per IP to respond to ARP requests. If that node goes down, another node takes over within seconds (controlled by the leaseDuration in the policy).

For more control, scope the policy to specific node labels:

yamlspec:
  nodeSelector:
    matchLabels:
      role: edge-node
  interfaces:
    - ^eth0
  loadBalancerIPs: true

This restricts L2 announcements to nodes labeled role: edge-node, useful if only certain nodes are on the external network.


Verifying it works

After deploying a Service of type: LoadBalancer:

bash# Check IP was assigned
kubectl get svc my-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}'

# Verify Cilium is announcing it
cilium l2announce list

# Check which node is the current leader for an IP
cilium l2announce get 192.168.1.100

# Confirm the IP responds from another machine on the LAN
ping 192.168.1.100
curl http://192.168.1.100/healthz

If the IP is assigned but not reachable:

bash# Check Cilium agent logs for L2 announcement errors
kubectl -n kube-system logs -l app.kubernetes.io/name=cilium | grep -i "l2announce\|arp"

# Check the policy is applied
kubectl get ciliuml2announcementpolicies

# Run ARP from the gateway/switch to verify
arping -I eth0 192.168.1.100

Common pitfall: rp_filter

If external traffic hits the load balancer IP but response packets are dropped, check reverse path filtering:

bashsysctl net.ipv4.conf.eth0.rp_filter

It must be 0 or 2 (loose). Strict mode (1) causes Cilium to drop return packets because they exit a different interface than they arrived on. Set it via sysctl in your Ansible playbook so it persists across reboots:

yaml- { key: net.ipv4.conf.all.rp_filter,     value: "0" }
- { key: net.ipv4.conf.default.rp_filter, value: "0" }

Debugging with the Cilium CLI

The cilium CLI (installed on each node or run via kubectl exec) is the primary debugging tool:

bash# Check Cilium status and health
cilium status --verbose

# Monitor live traffic flows (like tcpdump for Cilium)
cilium monitor --type drop   # only show dropped packets
cilium monitor --type l7     # L7 HTTP flows

# Check eBPF map for a Service
cilium service list
cilium service get <id>

# Endpoint connectivity test
cilium connectivity test

# Check BPF programs loaded
bpftool prog list | grep cilium

Hubble is the higher-level observability tool:

bash# Real-time flow view
hubble observe --follow

# Filter by namespace
hubble observe --namespace default --follow

# Show dropped flows only
hubble observe --verdict DROPPED

# HTTP flows for a specific pod
hubble observe --pod frontend --protocol HTTP

kube-proxy vs Cilium eBPF

kube-proxy Cilium eBPF
Service routing iptables DNAT chains eBPF sockmap / hash
Scalability O(n) rules O(1) lookup
LoadBalancer IPs needs MetalLB/cloud Cilium L2 announcements
Observability none Hubble flows
Node port iptables eBPF NodePort
Connection tracking kernel conntrack eBPF CT (bypasses conntrack)
Network policies iptables eBPF (more expressive)

The operational tradeoff: when something goes wrong, you debug with cilium CLI instead of iptables -L, which is more pleasant. The eBPF stack has fewer moving parts and the tooling is better.

The main reason not to switch: if your team is deeply familiar with iptables debugging and has existing tooling around it, the learning curve for eBPF-based debugging is real. Budget time for it.