Introduction
Kubernetes has become the de‑facto platform for running containerized workloads at scale. While most developers interact with the API server, pods, and services daily, the underlying networking layer remains a black box for many. Yet, a solid grasp of how Kubernetes networking works internally is essential for backend engineers who need to:
- Diagnose connectivity issues quickly.
- Design resilient multi‑tier applications.
- Implement secure network policies.
- Choose the right CNI plugin for their workload characteristics.
This guide dives deep into the internals of Kubernetes networking, covering everything from the Linux network namespace that isolates each pod to the sophisticated routing performed by kube-proxy. Along the way, you’ll find practical code snippets, YAML examples, and real‑world context that you can apply to production clusters today.
Core Concepts
Pods and IP Addresses
At the heart of Kubernetes networking is the pod—the smallest deployable unit. Each pod runs one or more containers sharing:
- A network namespace (isolated network stack).
- An IP address that is routable throughout the cluster.
- A port space (all containers in a pod share the same ports).
Note: Pods are flat in the networking sense—there is no concept of “sub‑net per namespace” unless the chosen CNI implements it.
When a pod is created, the kubelet asks the configured Container Network Interface (CNI) plugin to allocate an IP address from the cluster’s pod CIDR (e.g., 10.244.0.0/16). The plugin also sets up a virtual Ethernet pair (veth) that bridges the pod’s network namespace to the node’s root namespace, enabling traffic to leave the host.
Services and Virtual IPs
A Service abstracts a set of pods behind a stable ClusterIP (virtual IP). Clients inside the cluster can reach the service via my-service.my-namespace.svc.cluster.local. The service object does not own a physical NIC; instead, it relies on kube-proxy to program rules that translate the virtual IP to the actual pod IPs.
Network Namespaces
Linux network namespaces provide the isolation that makes pod‑level networking possible. Each pod gets its own namespace, with its own:
- IP address.
- Routing table.
- Firewall rules (
iptables/nftables).
The node’s root namespace contains the bridge (or overlay device) that connects all pod veth pairs. This bridge is where inter‑pod traffic is switched or routed, depending on the CNI implementation.
The CNI (Container Network Interface) Model
CNI is a specification and runtime for configuring network interfaces in Linux containers. Kubernetes delegates the low‑level plumbing to CNI plugins, which can be broadly classified into two categories:
| Category | Typical Use‑Case | Example Plugins |
|---|---|---|
| Overlay | Clusters spanning multiple subnets or cloud regions; abstracts the underlying network. | Flannel (VXLAN), Weave Net |
| Underlay / Native | High‑performance, low‑latency workloads; leverages existing L2/L3 fabric. | Calico (BGP), Cilium (eBPF) |
CNI Plugin Lifecycle
When a pod is scheduled, the kubelet invokes the CNI binary with a ADD command:
/opt/cni/bin/calico \
ADD \
/var/run/cni/netns/<pod-netns-id> \
/etc/cni/net.d/10-calico.conflist
The plugin:
- Allocates an IP from its pool (or via IPAM).
- Creates a veth pair (
veth0in the pod,veth1on the host). - Attaches
veth1to a bridge or routes it via BGP. - Optionally applies network policies (e.g., Calico’s policy engine).
On pod termination, the kubelet calls the DEL command, and the plugin tears down the interface and releases the IP.
Pod‑to‑Pod Communication
Overlay vs. Underlay
Overlay Networks (e.g., Flannel VXLAN) encapsulate pod traffic in a second tunnel header (VXLAN, Geneve). This allows pods on different nodes to communicate even if the underlying network does not provide direct routing for pod CIDRs.
Underlay Networks (e.g., Calico with BGP) advertise each node’s pod CIDR directly into the data‑center routing fabric. Traffic flows natively, avoiding encapsulation overhead.
IP Masquerading and NAT
Kubernetes performs masquerade (SNAT) for traffic that exits the cluster (e.g., to the internet). By default, the iptables chain POSTROUTING rewrites the source IP to the node’s primary IP for pods whose destination is not within the cluster CIDR.
iptables -t nat -A POSTROUTING -s 10.244.0.0/16 ! -d 10.244.0.0/16 -j MASQUERADE
Masquerading can be disabled per‑namespace or per‑pod using the IPMasqAgent or by annotating the pod with k8s.v1.cni.cncf.io/networks.
Service Types and Their Networking
| Service Type | Cluster‑Internal Behavior | External Exposure |
|---|---|---|
| ClusterIP | Virtual IP reachable only inside the cluster. | None |
| NodePort | Exposes the same ClusterIP on a static port of each node (30000‑32767). | Direct node‑level access |
| LoadBalancer | Provisions an external load balancer (cloud provider) that forwards to NodePorts. | Cloud‑LB IP |
| ExternalName | Maps a service name to an external DNS name; no IP allocation. | DNS CNAME |
ClusterIP Implementation
When a client sends traffic to a ClusterIP, kube-proxy intercepts the packet (via iptables or IPVS) and performs DNAT to one of the backend pod IPs:
# iptables example (simplified)
-A KUBE-SVC-XXXXX -j KUBE-SEP-YYYYY
-A KUBE-SEP-YYYYY -s 10.244.1.5 -j DNAT --to-destination 10.244.2.7:8080
kube-proxy also maintains session affinity (if enabled) by persisting the chosen backend for a given client IP.
NodePort Flow
- External client →
NodeIP:NodePort. kube-proxyDNATs to the ClusterIP.- Step 2 repeats the standard ClusterIP flow.
LoadBalancer Flow
The cloud provider’s LB forwards traffic to the node’s NodePort, after which the same path as above is followed.
kube-proxy: How It Implements Service Routing
kube-proxy runs as a daemonset on every node. It watches the API server for Service and Endpoint objects and programs the node’s kernel accordingly. There are two primary modes:
iptables Mode
- Stateless: each packet is processed by a chain of
iptablesrules. - Pros: No extra userspace process; low overhead.
- Cons: Large rule sets can degrade performance at scale.
Example iptables Chains
# KUBE-SERVICES chain contains a rule for each Service
-A KUBE-SERVICES -d 10.96.0.1/32 -p tcp -m comment --comment "my-svc" -m recent --name KUBE-SEP-abc123 --rsource -j KUBE-SEP-abc123
# KUBE-SEP-abc123 performs DNAT to a random pod endpoint
-A KUBE-SEP-abc123 -m statistic --mode random --probability 0.3333333333 -j KUBE-SEP-1
-A KUBE-SEP-abc123 -m statistic --mode random --probability 0.5000000000 -j KUBE-SEP-2
-A KUBE-SEP-abc123 -j KUBE-SEP-3
IPVS Mode
- Stateful: uses Linux IP Virtual Server (IPVS) to create virtual services and real servers.
- Pros: Handles tens of thousands of services with minimal rule bloat; supports more load‑balancing algorithms.
- Cons: Requires the
ipvsadmpackage and kernel modules.
IPVS Configuration Example
# Create a virtual service for the ClusterIP
ipvsadm -A -t 10.96.0.1:80 -s rr
# Add real servers (pod IPs)
ipvsadm -a -t 10.96.0.1:80 -r 10.244.2.7:8080 -g
ipvsadm -a -t 10.96.0.1:80 -r 10.244.2.8:8080 -g
Both modes ultimately achieve the same result—mapping a stable virtual IP to a set of pod IPs—but IPVS is preferred for high‑throughput, large‑scale clusters.
DNS in Kubernetes
Kubernetes provides service discovery via an internal DNS server, usually CoreDNS. Each Service gets a DNS entry of the form:
<service-name>.<namespace>.svc.cluster.local
CoreDNS runs as a Deployment with a ConfigMap that determines the upstream resolvers and stub zones. When a pod queries my-service, the DNS request is resolved to the Service’s ClusterIP, after which the kube-proxy routing steps apply.
CoreDNS Configuration Snippet
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
Network Policies
Network Policies let you enforce zero‑trust security at the pod level. They are implemented by the underlying CNI (e.g., Calico, Cilium) and translated into iptables or eBPF rules.
Example NetworkPolicy YAML
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: backend-allow-frontend
namespace: prod
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
This policy allows only pods labeled app=frontend to reach app=backend on port 8080. All other traffic is dropped.
Tip: When using Calico, you can inspect the generated iptables rules with
calicoctl get policy -o yaml.
Ingress Controllers and Service Meshes
Ingress Basics
An Ingress resource defines HTTP(S) routing rules for external traffic, but it requires an Ingress Controller (e.g., NGINX, HAProxy, Traefik) to implement those rules. The controller watches Ingress objects and configures a load balancer (often a NodePort + external LB) accordingly.
Minimal Ingress Example
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web-ingress
namespace: prod
spec:
ingressClassName: nginx
rules:
- host: api.example.com
http:
paths:
- path: /v1
pathType: Prefix
backend:
service:
name: api-service
port:
number: 80
Service Mesh (Istio) Networking
A service mesh injects a sidecar proxy (Envoy) into each pod, turning the mesh into a layer‑7 networking fabric. The mesh handles:
- Fine‑grained traffic routing (canary, A/B testing).
- Mutual TLS (mTLS) for pod‑to‑pod encryption.
- Observability (metrics, tracing).
Istio’s control plane (istiod) distributes EnvoyFilter configurations, while the data plane proxies intercept all inbound/outbound traffic, effectively bypassing kube-proxy for intra‑mesh communication.
Observability and Troubleshooting
Essential Tools
| Tool | Use‑Case |
|---|---|
kubectl exec -it <pod> -- sh | Inspect container network namespace |
nsenter (via netshoot image) | Jump into host or pod netns |
tcpdump -i any | Capture live traffic |
ip netns list | List namespaces (useful on node) |
iptables -L -t nat -n -v | Verify DNAT/SNAT rules |
ipvsadm -Ln | Inspect IPVS virtual services |
calicoctl get policy | View Calico network policies |
cilium monitor | Real‑time eBPF event stream |
Debugging Workflow
- Validate DNS –
nslookup my-service.my-namespace.svc.cluster.local. - Check Service Endpoints –
kubectl get endpoints my-service -o yaml. - Inspect iptables/IPVS – ensure DNAT rules exist for the Service IP.
- Trace the packet – use
tcpdumpon the node’s bridge (cbr0orflannel.1). - Confirm CNI state –
calicoctl node statusorcilium status. - Review NetworkPolicy logs – some CNIs emit logs when a policy blocks traffic.
Real‑World Example: Deploying a Multi‑Tier Application
Consider a classic three‑tier web app:
- Frontend (React) → Service
frontend. - Backend API (Go) → Service
api. - Database (PostgreSQL) → Service
postgres.
Deployment Manifests
# frontend-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend
labels:
app: frontend
spec:
replicas: 3
selector:
matchLabels:
app: frontend
template:
metadata:
labels:
app: frontend
spec:
containers:
- name: nginx
image: nginx:stable-alpine
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: frontend
labels:
app: frontend
spec:
selector:
app: frontend
ports:
- protocol: TCP
port: 80
targetPort: 80
type: LoadBalancer
# api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
labels:
app: api
spec:
replicas: 2
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
containers:
- name: api
image: myorg/api:1.2.3
ports:
- containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: api
labels:
app: api
spec:
selector:
app: api
ports:
- protocol: TCP
port: 80
targetPort: 8080
type: ClusterIP
# postgres-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
labels:
app: postgres
spec:
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:13
env:
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: pg-secret
key: password
ports:
- containerPort: 5432
---
apiVersion: v1
kind: Service
metadata:
name: postgres
labels:
app: postgres
spec:
selector:
app: postgres
ports:
- protocol: TCP
port: 5432
targetPort: 5432
type: ClusterIP
Network Flow Description
- Client → Frontend: External traffic hits the cloud LB, which forwards to a node’s
NodePort.kube-proxyDNATs to thefrontendService’s ClusterIP, then to one of the three NGINX pods. - Frontend → API: The NGINX container resolves
apivia CoreDNS (api.default.svc.cluster.local→ ClusterIP). The request is DNAT‑routed to one of the API pods. - API → PostgreSQL: The Go process connects to
postgres:5432. CoreDNS resolves the Service,kube-proxyroutes to the single PostgreSQL pod. - Response Path: Replies follow the reverse path, with NAT ensuring the source IP appears as the originating Service IP (unless
externalTrafficPolicy: Localis set on the LoadBalancer).
Performance tip: If latency between API and DB is a concern, consider colocating them on the same node and using a
hostNetwork: truepod for the DB (only for trusted workloads) to eliminate an extra hop.
Best Practices
Plan Pod CIDRs Carefully
- Avoid overlapping CIDRs across clusters or VPNs.
- Reserve a /16 per cluster for future scaling.
Choose the Right CNI for Your Workload
- Overlay (Flannel, Weave) – Simple, works in most cloud VPCs.
- BGP (Calico) – Low overhead, good for large clusters.
- eBPF (Cilium) – High performance, advanced security (transparent encryption).
Secure Ingress/Egress
- Enforce NetworkPolicies at namespace boundaries.
- Use PodSecurityPolicies (or the newer Pod Security Standards) to restrict privileged containers.
Monitor
kube-proxyMode- For >10k Services, switch to IPVS (
--proxy-mode=ipvs). - Verify kernel modules:
lsmod | grep ip_vs.
- For >10k Services, switch to IPVS (
Leverage Service Mesh Sparingly
- Meshes add latency and operational complexity. Deploy only where you need advanced traffic management or zero‑trust mTLS.
Enable IP Masquerade Agent (if needed)
- Prevent IP spoofing and ensure correct source IPs for external services.
Regularly Audit DNS
- Run
corednshealth checks and monitor query latency; DNS failures are a common cause of “service not reachable” alerts.
- Run
Conclusion
Kubernetes networking is a layered masterpiece: from the low‑level Linux namespaces that give each pod its own IP stack, through the CNI plugins that weave those namespaces into a coherent cluster‑wide fabric, up to the higher‑level abstractions of Services, Ingress, and Service Meshes. Understanding the internals—how kube-proxy programs iptables or IPVS, how CoreDNS resolves service names, and how network policies translate into kernel rules—empowers backend engineers to design resilient architectures, troubleshoot elusive connectivity bugs, and make informed decisions about CNI and security strategies.
By mastering these concepts, you’ll be able to:
- Diagnose and fix network issues faster than ever.
- Optimize performance by selecting the right networking stack.
- Harden your workloads with fine‑grained policies.
- Seamlessly integrate advanced traffic routing via Ingress or a Service Mesh.
Kubernetes continues to evolve, but the foundational networking principles covered here will remain relevant as the ecosystem matures. Keep experimenting, stay current with CNI releases, and always pair theory with hands‑on validation in a test cluster.
Resources
Kubernetes Official Documentation – Networking
https://kubernetes.io/docs/concepts/cluster-administration/networking/CNI Specification
https://github.com/containernetworking/cniCalico Network Policies Guide
https://projectcalico.org/docs/Cilium eBPF Networking
https://cilium.io/IPVS – Linux Virtual Server
https://www.linuxvirtualserver.org/software/ipvs.html
Feel free to dive into these resources to deepen your understanding, experiment with different CNI plugins, and stay up‑to‑date with the latest networking advances in Kubernetes. Happy clustering!