Introduction
System design is the craft of turning ideas into resilient, scalable, and cost‑effective products. It spans protocols, storage engines, compute orchestration, observability, and more. This deep, zero‑to‑hero tutorial curates the top 50 technologies you should know—organized by category—with concise explanations, practical tips, code samples, and a learning path. Whether you’re preparing for interviews or architecting large‑scale systems, use this guide as your roadmap.
Note: You don’t have to master everything at once. Build a foundation, then layer on technologies as your use cases demand.
How to use this guide
- Skim categories to identify your current gaps.
- For each technology, learn the “why,” then practice the “how” with the resource links at the end.
- Combine 1–2 tools per category into small projects (e.g., a URL shortener, event pipeline, or analytics service).
- Revisit with real workloads—system design only “clicks” under load, failure, and evolving requirements.
Table of contents
- Foundations of system design
- Networking & delivery
- Compute & orchestration
- Data & storage
- Caching & in‑memory
- Messaging & streaming
- Search & indexing
- Coordination & service discovery
- Observability
- Security & identity
- Infrastructure & delivery
- Zero‑to‑hero learning path
- Code examples
- Resources (official links)
Foundations of system design
Before the tools, master these principles:
- SLAs/SLOs/SLIs, latency budgets, and tail latencies (p99+)
- Throughput, backpressure, queueing, and flow control
- CAP theorem, PACELC, consistency models, idempotency
- Partitioning (sharding), replication, quorum, and leader election
- Caching strategies: write‑through, write‑back, TTL, cache stampede protections
- Data modeling for transactional vs analytical workloads
- Failure domains, blast radius, and graceful degradation
- Cost awareness: compute, egress, storage, and ops overhead
Top 50 technologies every system designer should know
Networking & delivery
1) HTTP/2
Multiplexed streams over a single TCP connection; header compression; server push (now largely deprecated). Improves latency and throughput for many web APIs.
2) HTTP/3 (QUIC)
Runs over UDP, avoiding head‑of‑line blocking at the transport layer; faster connection setup and loss recovery. Critical for mobile and high‑latency networks.
3) TLS 1.3
Modern cryptography with faster handshakes and perfect forward secrecy. Baseline for secure production systems; reduces round trips during setup.
4) gRPC
High‑performance RPC framework with code‑generated stubs and deadlines. Great for service‑to‑service calls with strong contracts.
5) Protocol Buffers
Compact, schema‑evolving data format for RPC, storage, and streaming. Reduces payload size and enforces compatibility.
6) Nginx
Battle‑tested reverse proxy and web server. Use it for TLS termination, static hosting, and simple routing.
7) Envoy Proxy
Modern L7 proxy with service discovery, retries, circuit breaking, and observability primitives; core of many meshes and gateways.
8) Kong API Gateway
Enterprise‑ready API gateway with plugins for auth, rate limiting, analytics, and transformations.
9) Cloudflare CDN
Global Anycast CDN, DDoS protection, WAF, and edge compute. Offload origin load and serve content closer to users.
Compute & orchestration
10) Docker
Standard container packaging for reproducible builds and isolated runtime environments.
11) Kubernetes
Declarative orchestration for scaling, self‑healing, service discovery, and rollouts.
12) Helm
Package manager for Kubernetes. Templates complex app deployments with versioning and config overlays.
13) Istio (Service Mesh)
Traffic management, mTLS, retries, timeouts, and observability via sidecars powered by Envoy.
14) AWS Lambda (Serverless)
Event‑driven compute with sub‑second autoscaling; ideal for bursty workloads and glue code.
15) KEDA (K8s Event‑Driven Autoscaling)
Autoscale Kubernetes workloads on external metrics (Kafka lag, queue depth, Prometheus queries).
Data & storage
16) PostgreSQL
Feature‑rich relational database with strong consistency, extensions (PostGIS, Timescale), and robust indexing options.
17) MySQL
Widely used relational database with great replication and tooling; excellent for OLTP.
18) MongoDB
Document database with flexible schema and rich query features; good for JSON‑heavy domains.
19) DynamoDB
Managed, serverless key‑value and document store with single‑digit millisecond latency at scale.
20) Cassandra
Wide‑column database built for write‑heavy, multi‑region workloads with tunable consistency.
21) ClickHouse
Blazing‑fast columnar OLAP engine for analytics, logs, and real‑time dashboards.
22) Neo4j
Graph database for relationship‑intensive workloads: recommendations, fraud, knowledge graphs.
23) TimescaleDB
Time‑series extension for Postgres; hypertables, compression, and continuous aggregates.
24) InfluxDB
Purpose‑built time‑series DB with retention policies and downsampling; popular for metrics and IoT.
25) Amazon S3
Durable object storage with lifecycle management; backbone for data lakes, backups, and static hosting.
Caching & in‑memory
26) Redis
In‑memory data structure store for caching, rate limiting, leaderboards, streams, and locks.
27) Memcached
Lightweight, distributed in‑memory cache for simple key/value with ultra‑low latency.
Messaging & streaming
28) Apache Kafka
Distributed commit log for high‑throughput event streaming, durable pub/sub, and stream processing.
29) RabbitMQ
Feature‑rich AMQP broker with routing patterns (topic, fanout), dead‑lettering, and acknowledgments.
30) NATS
Simple, high‑performance messaging with request/reply, JetStream for persistence, and low operational burden.
31) Apache Pulsar
Segmented storage + compute separation; geo‑replication and multi‑tenant streaming with functions.
32) AWS SQS
Fully managed queue with at‑least‑once delivery, dead‑letter queues, and serverless‑friendly semantics.
33) Google Pub/Sub
Global, managed pub/sub with push/pull delivery and simple autoscaling.
Search & indexing
34) Elasticsearch
Distributed search and analytics engine with inverted indices, aggregations, and near real‑time updates.
Coordination & service discovery
35) etcd
Consistent, highly available key‑value store used by Kubernetes; great for configuration and leader election.
36) Apache ZooKeeper
Classic coordination service powering distributed locks, config, and naming.
37) Consul
Service discovery, config KV, health checks, and service mesh integrations.
Observability
38) Prometheus
Time‑series metrics and alerting with pull‑based scraping and PromQL.
39) Grafana
Visualization platform for dashboards across metrics, logs, and traces.
40) OpenTelemetry
Open standard for tracing, metrics, and logs instrumentation across languages.
41) Jaeger
Distributed tracing system to analyze latencies, dependencies, and bottlenecks.
Security & identity
42) HashiCorp Vault
Secrets management, encryption as a service, dynamic credentials, and PKI.
43) Keycloak
Open‑source IAM with OAuth2/OIDC/SAML, federation, and user management.
Infrastructure & delivery
44) Terraform
Declarative IaC across clouds with a large provider ecosystem and plan/apply workflows.
45) AWS CloudFormation
Native AWS IaC with deep service integration and drift detection.
46) GitHub Actions
CI/CD pipelines tightly integrated with your repo; great for building, testing, and deploying.
47) Argo CD
GitOps for Kubernetes: declarative deployments, drift detection, and rollbacks.
48) Ansible
Agentless configuration management and orchestration using YAML playbooks.
49) Docker Compose
Define multi‑container local environments; ideal for dev/test and lightweight demos.
50) Packer
Automate machine images for multiple platforms (AMI, Azure, GCE) with a single template.
Zero‑to‑hero learning path
- Foundations (weeks 1–2)
- Read CAP/PACELC; practice SLIs/SLOs; revisit HTTP/TLS basics.
- Build: Nginx reverse proxy in front of a simple API.
- Data and caching (weeks 3–4)
- Model relational vs document workloads; add Redis caching with cache‑aside.
- Build: Postgres‑backed API with indexes and basic Redis caching.
- Async and scale (weeks 5–6)
- Add Kafka or RabbitMQ for background processing.
- Build: Event‑driven pipeline (ingest → process → store → query).
- Operate and observe (weeks 7–8)
- Containerize with Docker, deploy to Kubernetes (Helm charts).
- Add Prometheus/Grafana, instrument with OpenTelemetry and Jaeger.
- Secure and ship (weeks 9–10)
- Introduce Vault for secrets and Keycloak for auth/OIDC.
- Use Terraform for infra, GitHub Actions + Argo CD for delivery.
- Advanced (ongoing)
- Try ClickHouse for analytics, KEDA for autoscaling, and Multi‑region patterns with S3/Cassandra.
Code examples (bite‑size)
gRPC with Protocol Buffers (hello world)
Proto definition:
syntax = "proto3";
package hello;
service Greeter {
rpc SayHello (HelloRequest) returns (HelloReply) {}
}
message HelloRequest { string name = 1; }
message HelloReply { string message = 1; }
Minimal Python server:
# pip install grpcio grpcio-tools
import grpc
from concurrent import futures
import hello_pb2, hello_pb2_grpc
class Greeter(hello_pb2_grpc.GreeterServicer):
def SayHello(self, request, context):
return hello_pb2.HelloReply(message=f"Hello, {request.name}!")
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
hello_pb2_grpc.add_GreeterServicer_to_server(Greeter(), server)
server.add_insecure_port("[::]:50051")
server.start(); server.wait_for_termination()
Redis token bucket rate limiter (Python)
# pip install redis
import time, redis
r = redis.Redis(host="localhost", port=6379, decode_responses=True)
def allow(user_id, rate=5, per=1): # 5 tokens per second
key = f"bucket:{user_id}"
now = time.time()
with r.pipeline() as p:
p.zremrangebyscore(key, 0, now - per)
p.zcard(key)
p.expire(key, per)
removed, count, _ = p.execute()
if count < rate:
r.zadd(key, {str(now): now})
return True
return False
Postgres indexing for hot queries
-- Fast lookups by email
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_users_email ON users (email);
-- Composite index for common filter/sort
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_orders_user_created
ON orders (user_id, created_at DESC);
Kubernetes Deployment + HPA
apiVersion: apps/v1
kind: Deployment
metadata: { name: api }
spec:
replicas: 2
selector: { matchLabels: { app: api } }
template:
metadata: { labels: { app: api } }
spec:
containers:
- name: api
image: ghcr.io/acme/api:1.0.0
resources:
requests: { cpu: "250m", memory: "256Mi" }
limits: { cpu: "500m", memory: "512Mi" }
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api-hpa }
spec:
scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }
Terraform: S3 bucket (versioned, private)
terraform {
required_providers { aws = { source = "hashicorp/aws", version = "~> 5.0" } }
}
provider "aws" { region = "us-east-1" }
resource "aws_s3_bucket" "logs" {
bucket = "acme-prod-logs"
}
resource "aws_s3_bucket_versioning" "logs" {
bucket = aws_s3_bucket.logs.id
versioning_configuration { status = "Enabled" }
}
resource "aws_s3_bucket_public_access_block" "logs" {
bucket = aws_s3_bucket.logs.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
Kafka consumer (Python)
# pip install confluent-kafka
from confluent_kafka import Consumer
c = Consumer({
'bootstrap.servers': 'localhost:9092',
'group.id': 'payments-workers',
'auto.offset.reset': 'earliest'
})
c.subscribe(['payments'])
try:
while True:
msg = c.poll(1.0)
if msg is None: continue
if msg.error():
print("Error:", msg.error()); continue
process_payment(msg.value())
finally:
c.close()
Nginx reverse proxy with timeouts and retries
events {}
http {
upstream api_upstream {
server api1:8080 max_fails=3 fail_timeout=10s;
server api2:8080 max_fails=3 fail_timeout=10s;
}
server {
listen 443 ssl http2;
server_name api.example.com;
ssl_certificate /etc/ssl/fullchain.pem;
ssl_certificate_key /etc/ssl/privkey.pem;
location / {
proxy_pass http://api_upstream;
proxy_connect_timeout 5s;
proxy_send_timeout 10s;
proxy_read_timeout 10s;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
}
OpenTelemetry instrumentation (Node.js)
// npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const sdk = new NodeSDK({
instrumentations: [getNodeAutoInstrumentations()],
serviceName: 'payment-service',
serviceVersion: '1.0.0',
});
sdk.start();
// Your app code here
const express = require('express');
const app = express();
app.get('/payment/:id', async (req, res) => {
// Automatically traced!
const payment = await getPayment(req.params.id);
res.json(payment);
});
app.listen(3000);
Terraform: EKS cluster with managed node group
module "eks" {
source = "terraform-aws-modules/eks/aws"
cluster_name = "production"
cluster_version = "1.29"
subnets = module.vpc.private_subnets
vpc_id = module.vpc.vpc_id
node_groups = {
main = {
desired_capacity = 3
max_capacity = 10
min_capacity = 3
instance_types = ["t3.large"]
k8s_labels = {
Environment = "production"
NodeGroup = "main"
}
}
}
}
Docker Compose: Multi-service development environment
version: '3.8'
services:
api:
build: .
ports: ["3000:3000"]
environment:
- DATABASE_URL=postgres://user:pass@db:5432/app
- REDIS_URL=redis://redis:6379
depends_on: [db, redis]
db:
image: postgres:15
environment:
POSTGRES_DB: app
POSTGRES_USER: user
POSTGRES_PASSWORD: pass
volumes: ["postgres_data:/var/lib/postgresql/data"]
redis:
image: redis:7-alpine
worker:
build: .
command: node worker.js
environment:
- DATABASE_URL=postgres://user:pass@db:5432/app
depends_on: [db]
volumes:
postgres_data:
Resources (official links)
Networking & delivery
- HTTP/2 Specification
- HTTP/3 (QUIC) Specification
- TLS 1.3 Specification
- gRPC Documentation
- Protocol Buffers
- Nginx Official Docs
- Envoy Proxy
- Kong API Gateway
- Cloudflare CDN
Compute & orchestration
- Docker Documentation
- Kubernetes Documentation
- Helm Documentation
- Istio Documentation
- AWS Lambda
- KEDA Documentation
Data & storage
- PostgreSQL Documentation
- MySQL Documentation
- MongoDB Documentation
- DynamoDB Developer Guide
- Apache Cassandra
- ClickHouse Documentation
- Neo4j Documentation
- TimescaleDB Documentation
- InfluxDB Documentation
- Amazon S3 Documentation
Caching & in-memory
Messaging & streaming
- Apache Kafka Documentation
- RabbitMQ Documentation
- NATS Documentation
- Apache Pulsar Documentation
- AWS SQS Documentation
- Google Pub/Sub Documentation
Search & indexing
Coordination & service discovery
Observability
Security & identity
Infrastructure & delivery
- Terraform Documentation
- AWS CloudFormation
- GitHub Actions Documentation
- Argo CD Documentation
- Ansible Documentation
- Docker Compose Documentation
- Packer Documentation
Conclusion
System design mastery is a journey, not a destination. The 50 technologies covered here represent the essential toolkit for modern distributed systems, but remember:
Focus on patterns, not just tools - Understanding when to use eventual consistency vs strong consistency matters more than knowing Cassandra’s specific configuration.
Build incrementally - Start with a simple PostgreSQL + Redis + Nginx stack, then add complexity as requirements demand.
Measure everything - Without proper observability (Prometheus + Grafana + OpenTelemetry), you’re flying blind.
Security by design - Implement TLS 1.3, proper authentication, and secrets management from day one.
Cost awareness - Every architectural decision has cost implications. Monitor and optimize continuously.
The best system designers combine deep technical knowledge with practical experience. Use this guide as your roadmap, but spend most of your time building, measuring, and iterating on real systems.
Happy building!