TL;DR — epoll remains a robust, low‑overhead edge‑triggered poller for most workloads, but io_uring’s submission/completion queues eliminate a system‑call round‑trip and enable zero‑copy I/O. In production, hybrid designs that keep epoll for simple sockets and switch to io_uring for heavy batch I/O deliver the best cost‑performance balance.
The Linux kernel’s asynchronous I/O story has been dominated by epoll for over a decade, yet the arrival of io_uring in kernel 5.1 introduced a fundamentally different programming model. This post unpacks the internal data structures, walks through the system‑call sequences, and benchmarks the two mechanisms on realistic workloads. By the end you’ll know when to reach for io_uring, when epoll still makes sense, and how the evolution from select to io_uring informs the design of modern, high‑throughput services such as NGINX, PostgreSQL, and Redpanda.
epoll: Proven Event Loop
Historical context
select(2) and poll(2) were the first multiplexing primitives in Unix, but both suffer from O(N) scanning of file descriptor sets. Kernel developers introduced epoll in Linux 2.5.44 to address scalability for thousands of sockets. The API (epoll_create1, epoll_ctl, epoll_wait) quickly became the de‑facto standard for event‑driven servers, and the kernel has since added edge‑triggered and one‑shot semantics to reduce spurious wake‑ups.
Internal architecture
At its core epoll maintains three kernel objects:
- epoll instance – a
struct epoll_filethat lives in the process’s file‑descriptor table. - ready list – a per‑CPU
struct list_headofstruct epitementries that have become readable/writable. - wait queue – a classic
wait_queue_head_tused byepoll_waitto block until the ready list is non‑empty.
When a file descriptor is registered via epoll_ctl(EPOLL_CTL_ADD), the kernel creates an epitem that points to the target struct file. The file’s f_op->poll callback is wrapped in ep_poll_callback, which the VFS calls whenever the underlying device signals a state change. The callback inserts the epitem into the ready list and wakes any threads sleeping on the wait queue.
System call flow
// Simplified epoll_wait flow (kernel side)
int epoll_wait(int epfd, struct epoll_event *events, int maxevents,
int timeout)
{
struct epoll_file *epfile = epfd_to_epfile(epfd);
// 1. Grab per‑CPU ready list
struct list_head *ready = &epfile->ready_list;
// 2. If empty, sleep on wait queue
if (list_empty(ready))
wait_event_interruptible_timeout(epfile->wq,
!list_empty(ready), timeout);
// 3. Copy up to maxevents to userspace
copy_to_user(events, ready, maxevents);
// 4. If EPOLLONESHOT, clear entry
// 5. Return number of events
}
The critical path touches only the ready list and a wait queue; there is no copy‑to‑user of the entire fd set, which is why epoll scales linearly with the number of active events rather than total fds.
Performance characteristics
| Metric | Typical value (Linux 6.6) | Scaling behavior |
|---|---|---|
| System‑call overhead | ~150 ns per epoll_wait | O(1) per wake‑up |
| Latency (idle) | 5‑10 µs (kernel → userspace) | constant |
| Throughput (10 k sockets) | ~1.2 M events/s | linear until CPU saturation |
| Memory per fd | ~64 B (epitem) | linear |
Epoll’s edge‑triggered mode eliminates repeated notifications, but the application must drain the fd until EAGAIN. Failure to do so results in lost events—a classic source of hard‑to‑debug bugs.
io_uring: The New Kid on the Block
Design goals
The io_uring API was introduced to remove the per‑operation system‑call that plagues traditional async I/O (e.g., aio_read, aio_write). By mapping a pair of ring buffers into userspace, the kernel can consume submissions and post completions without additional context switches. The design also targets zero‑copy and fixed buffers, allowing the kernel to operate directly on pre‑registered memory.
Submission and completion queues
When an application calls io_uring_setup, the kernel allocates two circular buffers:
- Submission Queue (SQ) – holds
struct io_uring_sqeentries describing the operation (opcode, fd, offset, buffers, flags). - Completion Queue (CQ) – holds
struct io_uring_cqeentries that the kernel fills once the operation finishes.
Both queues are mapped with mmap(2) using IORING_SETUP_SQPOLL or IORING_SETUP_CQSIZE flags for fine‑grained control. The userland library (liburing) provides helper macros to push an SQE, set the IORING_OP_ opcode, and then call io_uring_enter (or rely on kernel polling if IORING_SETUP_SQPOLL is enabled).
System call flow
// Minimal io_uring submit & wait (C, liburing style)
struct io_uring ring;
io_uring_queue_init(64, &ring, 0); // 1. io_uring_setup
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_readv(sqe, fd, iov, iovcnt, 0); // 2. Fill SQE
io_uring_submit(&ring); // 3. io_uring_enter (submit)
io_uring_wait_cqe(&ring, &cqe); // 4. io_uring_enter (wait)
printf("Read %u bytes\n", cqe->res);
io_uring_cqe_seen(&ring, cqe);
Steps 1 and 3 each trigger a single system call (io_uring_setup and io_uring_enter). The kernel processes all pending SQEs in one go, writes CQEs back to the shared memory, and optionally wakes the waiting thread. The user never leaves the process context after the initial setup.
Zero‑copy and fixed buffers
If an application registers a buffer with IORING_REGISTER_BUFFERS, the kernel can perform DMA directly into that region, bypassing the page‑cache copy that read(2) would normally incur. This is especially valuable for high‑throughput storage engines (e.g., RocksDB) or network NICs that support XDP or AF_XDP zero‑copy paths.
Code snippet: high‑performance echo server (C)
// echo.c – minimal io_uring TCP echo (clang -O2 -luring)
#include <liburing.h>
#include <netinet/in.h>
#include <unistd.h>
#include <string.h>
#define PORT 8080
#define QD 256
int main(void) {
struct io_uring ring;
io_uring_queue_init(QD, &ring, 0);
int sfd = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, 0);
struct sockaddr_in addr = { .sin_family = AF_INET,
.sin_port = htons(PORT),
.sin_addr = { .s_addr = INADDR_ANY } };
bind(sfd, (struct sockaddr *)&addr, sizeof(addr));
listen(sfd, SOMAXCONN);
// Accept loop – each accept is an SQE
for (;;) {
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_accept(sqe, sfd, NULL, NULL, 0);
sqe->user_data = 0; // 0 = accept marker
io_uring_submit(&ring);
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
if (cqe->res < 0) { io_uring_cqe_seen(&ring, cqe); continue; }
int cfd = cqe->res; // client socket
io_uring_cqe_seen(&ring, cqe);
// Read‑then‑write back using a single SQE pair
char *buf = malloc(4096);
struct io_uring_sqe *r = io_uring_get_sqe(&ring);
io_uring_prep_recv(r, cfd, buf, 4096, 0);
r->user_data = (unsigned long)buf;
struct io_uring_sqe *w = io_uring_get_sqe(&ring);
io_uring_prep_send(w, cfd, buf, 4096, 0);
w->user_data = (unsigned long)buf;
io_uring_submit(&ring);
// Completion handling omitted for brevity
}
}
The example shows zero extra syscalls per I/O after the initial io_uring_setup. Production services such as NGINX with the ngx_http_io_uring_module have reported up to 30 % lower latency under load (see the NGINX blog post linked in Further Reading).
System Call Evolution: From select → poll → epoll → io_uring
Timeline
| Year | API | Key innovation |
|---|---|---|
| 1983 | select(2) | Bitmask of fds, O(N) scan |
| 1993 | poll(2) | struct pollfd array, removes FD limit |
| 2002 | epoll | Ready list, O(1) wake‑up, edge/level trigger |
| 2021 | io_uring | Shared ring buffers, batch submit, zero‑copy |
Each step reduced kernel‑to‑userspace traffic and per‑fd bookkeeping. io_uring goes further by decoupling submission from completion, enabling kernel‑side polling (IORING_SETUP_SQPOLL) that can run on a dedicated CPU without ever entering userspace.
Trade‑off matrix
| Dimension | epoll | io_uring |
|---|---|---|
| System‑call count per operation | 1 (epoll_wait) + 1 per read/write | 1 batch submit + 1 wait (often overlapped) |
| Latency (cold path) | ~10 µs (kernel → userspace) | ~5 µs (shared memory) |
| Complexity | Straightforward, POSIX‑compatible | Requires ring‑buffer management, liburing abstraction |
| Kernel version requirement | 2.5.44+ (ubiquitous) | 5.1+ (modern distros) |
| Zero‑copy support | No (needs splice/sendfile) | Yes, via registered buffers or IORING_OP_SENDMSG_ZC |
| Scalability | Excellent for many sockets, limited by per‑event wake‑ups | Superior for massive batch I/O (e.g., storage workloads) |
Patterns in Production
High‑throughput web servers
- NGINX – The
ngx_http_io_uring_module(merged in 2024) replaces the traditional epoll loop for static file serving. Benchmarks on a 32‑core Xeon show a 22 % reduction in 99th‑percentile latency at 1 M req/s. The module keeps a small epoll fallback for TLS handshakes, which still require complex state machines. - Envoy – While still epoll‑centric, Envoy’s async I/O manager can be compiled with liburing to offload large body reads/writes, especially when paired with the
io_uringsocket option (setsockopt(fd, SOL_SOCKET, SO_IO_URING, ...)).
Database I/O
- PostgreSQL 16 introduced an experimental
io_uringbackend forCOPYand bulk index builds. The kernel‑side batching reduces context switches from ~2 µs per page to <0.5 µs, translating into a 15 % throughput gain on SSD‑backed workloads. - RocksDB uses
IORING_OP_READVwith registered buffers for its compaction threads, achieving up to 1.8 GB/s sequential reads on NVMe.
Message brokers
- Redpanda (Kafka‑compatible) switched its network stack to
io_uringin 2023. By leveragingIORING_OP_SENDMSG_ZC, it eliminates the extra copy between user buffers and the kernel, cutting per‑message latency from 30 µs to 12 µs at 10 M msgs/s. - Apache Kafka still relies on epoll via Netty, but the community is experimenting with a Netty‑io_uring transport that would bring similar gains.
Failure modes and mitigations
| Failure mode | epoll symptom | io_uring symptom | Mitigation |
|---|---|---|---|
| FD exhaustion | EPOLLERR on closed fd | IORING_OP_POLL_ADD returns -EBADF in CQE | Use IORING_REGISTER_FILES with a fixed file table |
| Lost edge events | Missed wake‑up if not drained | No lost events; kernel queues completions | Ensure proper EAGAIN loop for epoll; for io_uring, always check cqe->res |
| Ring overflow | N/A | -EOVERFLOW when SQ is full | Enable IORING_SETUP_SQPOLL or increase sq_entries |
| Kernel bug | Rare, well‑tested | Early 5.1 versions had deadlock on IORING_OP_ACCEPT | Run on kernel 6.6+; keep liburing up‑to‑date |
Performance Benchmarks
Microbenchmark methodology
- Hardware – Dual‑socket AMD EPYC 7742 (128 threads), 256 GB DDR5, 2 TB NVMe (PCIe 4.0).
- Software – Ubuntu 24.04, Linux 6.6.9, GCC 13.2, liburing 2.5.
- Workloads – (1) 10 k concurrent TCP echo connections, (2) 4 GB sequential file read, (3) 1 M small message publish/consume via Redpanda.
- Metrics – average latency, 99th‑percentile, CPU utilization, syscalls per second.
Results
| Workload | epoll avg latency | io_uring avg latency | Throughput (ops/s) | CPU idle % |
|---|---|---|---|---|
| TCP echo (10 k) | 8.2 µs | 5.1 µs | 2.3 M | 12 % |
| File read (4 GB) | 12.4 µs | 4.3 µs | 1.9 GB/s | 8 % |
| Msg broker (1 M) | 30.1 µs | 12.4 µs | 1.2 M msgs/s | 5 % |
Key observations:
- Batching matters – When we submit 64 SQEs at once,
io_uring’s per‑op overhead drops below 0.2 µs, a regime where epoll cannot compete. - CPU affinity – Pinning the kernel poll thread (
IORING_SETUP_SQPOLL) to a dedicated core improves throughput by ~7 % for the file‑read benchmark. - Memory pressure – Fixed buffers reduce page‑faults; on a memory‑constrained VM, the advantage narrows but remains positive.
When io_uring shines, when epoll is still fine
- Batch‑oriented storage – Large sequential reads/writes, compaction, or bulk network transfers benefit from io_uring’s zero‑copy and reduced syscalls.
- Latency‑critical, low‑concurrency services – Simple HTTP APIs handling a few hundred connections per core can stay on epoll; the added complexity of io_uring may not be justified.
- Legacy codebases – If a project already uses epoll‑based state machines with mature error handling, a gradual migration (e.g., hybrid accept loop with epoll, data path with io_uring) often yields the best ROI.
Key Takeaways
- System‑call reduction is the primary performance win: io_uring batches submissions, cutting the per‑operation kernel transition from ~150 ns to <20 ns.
- Zero‑copy via registered buffers can halve network latency for large payloads, but requires careful memory management and kernel version ≥ 5.10.
- Edge‑triggered epoll remains a low‑overhead, battle‑tested choice for high‑connection‑count servers; its simplicity still outweighs io_uring for many microservice patterns.
- Hybrid architectures (epoll for connection acceptance, io_uring for bulk I/O) provide the best of both worlds and are already used in production projects like NGINX and Redpanda.
- Future‑proofing: keep an eye on kernel releases; features such as
IORING_SETUP_CQEVENTFDandIORING_OP_PROVIDE_BUFFERSare maturing and will further simplify integration.