Introduction

System design is the craft of turning ideas into resilient, scalable, and cost‑effective products. It spans protocols, storage engines, compute orchestration, observability, and more. This deep, zero‑to‑hero tutorial curates the top 50 technologies you should know—organized by category—with concise explanations, practical tips, code samples, and a learning path. Whether you’re preparing for interviews or architecting large‑scale systems, use this guide as your roadmap.

Note: You don’t have to master everything at once. Build a foundation, then layer on technologies as your use cases demand.

How to use this guide

  • Skim categories to identify your current gaps.
  • For each technology, learn the “why,” then practice the “how” with the resource links at the end.
  • Combine 1–2 tools per category into small projects (e.g., a URL shortener, event pipeline, or analytics service).
  • Revisit with real workloads—system design only “clicks” under load, failure, and evolving requirements.

Table of contents

  • Foundations of system design
  • Networking & delivery
  • Compute & orchestration
  • Data & storage
  • Caching & in‑memory
  • Messaging & streaming
  • Search & indexing
  • Coordination & service discovery
  • Observability
  • Security & identity
  • Infrastructure & delivery
  • Zero‑to‑hero learning path
  • Code examples
  • Resources (official links)

Foundations of system design

Before the tools, master these principles:

  • SLAs/SLOs/SLIs, latency budgets, and tail latencies (p99+)
  • Throughput, backpressure, queueing, and flow control
  • CAP theorem, PACELC, consistency models, idempotency
  • Partitioning (sharding), replication, quorum, and leader election
  • Caching strategies: write‑through, write‑back, TTL, cache stampede protections
  • Data modeling for transactional vs analytical workloads
  • Failure domains, blast radius, and graceful degradation
  • Cost awareness: compute, egress, storage, and ops overhead

Top 50 technologies every system designer should know

Networking & delivery

1) HTTP/2

Multiplexed streams over a single TCP connection; header compression; server push (now largely deprecated). Improves latency and throughput for many web APIs.

2) HTTP/3 (QUIC)

Runs over UDP, avoiding head‑of‑line blocking at the transport layer; faster connection setup and loss recovery. Critical for mobile and high‑latency networks.

3) TLS 1.3

Modern cryptography with faster handshakes and perfect forward secrecy. Baseline for secure production systems; reduces round trips during setup.

4) gRPC

High‑performance RPC framework with code‑generated stubs and deadlines. Great for service‑to‑service calls with strong contracts.

5) Protocol Buffers

Compact, schema‑evolving data format for RPC, storage, and streaming. Reduces payload size and enforces compatibility.

6) Nginx

Battle‑tested reverse proxy and web server. Use it for TLS termination, static hosting, and simple routing.

7) Envoy Proxy

Modern L7 proxy with service discovery, retries, circuit breaking, and observability primitives; core of many meshes and gateways.

8) Kong API Gateway

Enterprise‑ready API gateway with plugins for auth, rate limiting, analytics, and transformations.

9) Cloudflare CDN

Global Anycast CDN, DDoS protection, WAF, and edge compute. Offload origin load and serve content closer to users.

Compute & orchestration

10) Docker

Standard container packaging for reproducible builds and isolated runtime environments.

11) Kubernetes

Declarative orchestration for scaling, self‑healing, service discovery, and rollouts.

12) Helm

Package manager for Kubernetes. Templates complex app deployments with versioning and config overlays.

13) Istio (Service Mesh)

Traffic management, mTLS, retries, timeouts, and observability via sidecars powered by Envoy.

14) AWS Lambda (Serverless)

Event‑driven compute with sub‑second autoscaling; ideal for bursty workloads and glue code.

15) KEDA (K8s Event‑Driven Autoscaling)

Autoscale Kubernetes workloads on external metrics (Kafka lag, queue depth, Prometheus queries).

Data & storage

16) PostgreSQL

Feature‑rich relational database with strong consistency, extensions (PostGIS, Timescale), and robust indexing options.

17) MySQL

Widely used relational database with great replication and tooling; excellent for OLTP.

18) MongoDB

Document database with flexible schema and rich query features; good for JSON‑heavy domains.

19) DynamoDB

Managed, serverless key‑value and document store with single‑digit millisecond latency at scale.

20) Cassandra

Wide‑column database built for write‑heavy, multi‑region workloads with tunable consistency.

21) ClickHouse

Blazing‑fast columnar OLAP engine for analytics, logs, and real‑time dashboards.

22) Neo4j

Graph database for relationship‑intensive workloads: recommendations, fraud, knowledge graphs.

23) TimescaleDB

Time‑series extension for Postgres; hypertables, compression, and continuous aggregates.

24) InfluxDB

Purpose‑built time‑series DB with retention policies and downsampling; popular for metrics and IoT.

25) Amazon S3

Durable object storage with lifecycle management; backbone for data lakes, backups, and static hosting.

Caching & in‑memory

26) Redis

In‑memory data structure store for caching, rate limiting, leaderboards, streams, and locks.

27) Memcached

Lightweight, distributed in‑memory cache for simple key/value with ultra‑low latency.

Messaging & streaming

28) Apache Kafka

Distributed commit log for high‑throughput event streaming, durable pub/sub, and stream processing.

29) RabbitMQ

Feature‑rich AMQP broker with routing patterns (topic, fanout), dead‑lettering, and acknowledgments.

30) NATS

Simple, high‑performance messaging with request/reply, JetStream for persistence, and low operational burden.

31) Apache Pulsar

Segmented storage + compute separation; geo‑replication and multi‑tenant streaming with functions.

32) AWS SQS

Fully managed queue with at‑least‑once delivery, dead‑letter queues, and serverless‑friendly semantics.

33) Google Pub/Sub

Global, managed pub/sub with push/pull delivery and simple autoscaling.

Search & indexing

34) Elasticsearch

Distributed search and analytics engine with inverted indices, aggregations, and near real‑time updates.

Coordination & service discovery

35) etcd

Consistent, highly available key‑value store used by Kubernetes; great for configuration and leader election.

36) Apache ZooKeeper

Classic coordination service powering distributed locks, config, and naming.

37) Consul

Service discovery, config KV, health checks, and service mesh integrations.

Observability

38) Prometheus

Time‑series metrics and alerting with pull‑based scraping and PromQL.

39) Grafana

Visualization platform for dashboards across metrics, logs, and traces.

40) OpenTelemetry

Open standard for tracing, metrics, and logs instrumentation across languages.

41) Jaeger

Distributed tracing system to analyze latencies, dependencies, and bottlenecks.

Security & identity

42) HashiCorp Vault

Secrets management, encryption as a service, dynamic credentials, and PKI.

43) Keycloak

Open‑source IAM with OAuth2/OIDC/SAML, federation, and user management.

Infrastructure & delivery

44) Terraform

Declarative IaC across clouds with a large provider ecosystem and plan/apply workflows.

45) AWS CloudFormation

Native AWS IaC with deep service integration and drift detection.

46) GitHub Actions

CI/CD pipelines tightly integrated with your repo; great for building, testing, and deploying.

47) Argo CD

GitOps for Kubernetes: declarative deployments, drift detection, and rollbacks.

48) Ansible

Agentless configuration management and orchestration using YAML playbooks.

49) Docker Compose

Define multi‑container local environments; ideal for dev/test and lightweight demos.

50) Packer

Automate machine images for multiple platforms (AMI, Azure, GCE) with a single template.

Zero‑to‑hero learning path

  1. Foundations (weeks 1–2)
  • Read CAP/PACELC; practice SLIs/SLOs; revisit HTTP/TLS basics.
  • Build: Nginx reverse proxy in front of a simple API.
  1. Data and caching (weeks 3–4)
  • Model relational vs document workloads; add Redis caching with cache‑aside.
  • Build: Postgres‑backed API with indexes and basic Redis caching.
  1. Async and scale (weeks 5–6)
  • Add Kafka or RabbitMQ for background processing.
  • Build: Event‑driven pipeline (ingest → process → store → query).
  1. Operate and observe (weeks 7–8)
  • Containerize with Docker, deploy to Kubernetes (Helm charts).
  • Add Prometheus/Grafana, instrument with OpenTelemetry and Jaeger.
  1. Secure and ship (weeks 9–10)
  • Introduce Vault for secrets and Keycloak for auth/OIDC.
  • Use Terraform for infra, GitHub Actions + Argo CD for delivery.
  1. Advanced (ongoing)
  • Try ClickHouse for analytics, KEDA for autoscaling, and Multi‑region patterns with S3/Cassandra.

Code examples (bite‑size)

gRPC with Protocol Buffers (hello world)

Proto definition:

syntax = "proto3";

package hello;

service Greeter {
  rpc SayHello (HelloRequest) returns (HelloReply) {}
}

message HelloRequest { string name = 1; }
message HelloReply   { string message = 1; }

Minimal Python server:

# pip install grpcio grpcio-tools
import grpc
from concurrent import futures
import hello_pb2, hello_pb2_grpc

class Greeter(hello_pb2_grpc.GreeterServicer):
    def SayHello(self, request, context):
        return hello_pb2.HelloReply(message=f"Hello, {request.name}!")

server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
hello_pb2_grpc.add_GreeterServicer_to_server(Greeter(), server)
server.add_insecure_port("[::]:50051")
server.start(); server.wait_for_termination()

Redis token bucket rate limiter (Python)

# pip install redis
import time, redis
r = redis.Redis(host="localhost", port=6379, decode_responses=True)

def allow(user_id, rate=5, per=1):  # 5 tokens per second
    key = f"bucket:{user_id}"
    now = time.time()
    with r.pipeline() as p:
        p.zremrangebyscore(key, 0, now - per)
        p.zcard(key)
        p.expire(key, per)
        removed, count, _ = p.execute()
    if count < rate:
        r.zadd(key, {str(now): now})
        return True
    return False

Postgres indexing for hot queries

-- Fast lookups by email
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_users_email ON users (email);

-- Composite index for common filter/sort
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_orders_user_created
ON orders (user_id, created_at DESC);

Kubernetes Deployment + HPA

apiVersion: apps/v1
kind: Deployment
metadata: { name: api }
spec:
  replicas: 2
  selector: { matchLabels: { app: api } }
  template:
    metadata: { labels: { app: api } }
    spec:
      containers:
      - name: api
        image: ghcr.io/acme/api:1.0.0
        resources:
          requests: { cpu: "250m", memory: "256Mi" }
          limits:   { cpu: "500m", memory: "512Mi" }
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api-hpa }
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }

Terraform: S3 bucket (versioned, private)

terraform {
  required_providers { aws = { source = "hashicorp/aws", version = "~> 5.0" } }
}
provider "aws" { region = "us-east-1" }

resource "aws_s3_bucket" "logs" {
  bucket = "acme-prod-logs"
}

resource "aws_s3_bucket_versioning" "logs" {
  bucket = aws_s3_bucket.logs.id
  versioning_configuration { status = "Enabled" }
}

resource "aws_s3_bucket_public_access_block" "logs" {
  bucket                  = aws_s3_bucket.logs.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

Kafka consumer (Python)

# pip install confluent-kafka
from confluent_kafka import Consumer

c = Consumer({
    'bootstrap.servers': 'localhost:9092',
    'group.id': 'payments-workers',
    'auto.offset.reset': 'earliest'
})
c.subscribe(['payments'])

try:
    while True:
        msg = c.poll(1.0)
        if msg is None: continue
        if msg.error():
            print("Error:", msg.error()); continue
        process_payment(msg.value())
finally:
    c.close()

Nginx reverse proxy with timeouts and retries

events {}
http {
  upstream api_upstream {
    server api1:8080 max_fails=3 fail_timeout=10s;
    server api2:8080 max_fails=3 fail_timeout=10s;
  }
  server {
    listen 443 ssl http2;
    server_name api.example.com;
    ssl_certificate     /etc/ssl/fullchain.pem;
    ssl_certificate_key   /etc/ssl/privkey.pem;
    
    location / {
      proxy_pass http://api_upstream;
      proxy_connect_timeout 5s;
      proxy_send_timeout 10s;
      proxy_read_timeout 10s;
      proxy_set_header Host $host;
      proxy_set_header X-Real-IP $remote_addr;
    }
  }
}

OpenTelemetry instrumentation (Node.js)

// npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');

const sdk = new NodeSDK({
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: 'payment-service',
  serviceVersion: '1.0.0',
});

sdk.start();

// Your app code here
const express = require('express');
const app = express();

app.get('/payment/:id', async (req, res) => {
  // Automatically traced!
  const payment = await getPayment(req.params.id);
  res.json(payment);
});

app.listen(3000);

Terraform: EKS cluster with managed node group

module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  cluster_name    = "production"
  cluster_version = "1.29"
  subnets         = module.vpc.private_subnets
  vpc_id          = module.vpc.vpc_id
  
  node_groups = {
    main = {
      desired_capacity = 3
      max_capacity     = 10
      min_capacity     = 3
      
      instance_types = ["t3.large"]
      k8s_labels = {
        Environment = "production"
        NodeGroup   = "main"
      }
    }
  }
}

Docker Compose: Multi-service development environment

version: '3.8'
services:
  api:
    build: .
    ports: ["3000:3000"]
    environment:
      - DATABASE_URL=postgres://user:pass@db:5432/app
      - REDIS_URL=redis://redis:6379
    depends_on: [db, redis]
    
  db:
    image: postgres:15
    environment:
      POSTGRES_DB: app
      POSTGRES_USER: user
      POSTGRES_PASSWORD: pass
    volumes: ["postgres_data:/var/lib/postgresql/data"]
    
  redis:
    image: redis:7-alpine
    
  worker:
    build: .
    command: node worker.js
    environment:
      - DATABASE_URL=postgres://user:pass@db:5432/app
    depends_on: [db]

volumes:
  postgres_data:

Networking & delivery

Compute & orchestration

Data & storage

Caching & in-memory

Messaging & streaming

Search & indexing

Coordination & service discovery

Observability

Security & identity

Infrastructure & delivery

Conclusion

System design mastery is a journey, not a destination. The 50 technologies covered here represent the essential toolkit for modern distributed systems, but remember:

Focus on patterns, not just tools - Understanding when to use eventual consistency vs strong consistency matters more than knowing Cassandra’s specific configuration.

Build incrementally - Start with a simple PostgreSQL + Redis + Nginx stack, then add complexity as requirements demand.

Measure everything - Without proper observability (Prometheus + Grafana + OpenTelemetry), you’re flying blind.

Security by design - Implement TLS 1.3, proper authentication, and secrets management from day one.

Cost awareness - Every architectural decision has cost implications. Monitor and optimize continuously.

The best system designers combine deep technical knowledge with practical experience. Use this guide as your roadmap, but spend most of your time building, measuring, and iterating on real systems.

Happy building!