Google Zanzibar: The Global Authorization System Powering Billions of Permissions
In the world of massive-scale internet services, managing who can access what is a monumental challenge. Google Zanzibar addresses this head-on as a globally distributed authorization system that handles trillions of access control lists (ACLs) and millions of queries per second while maintaining sub-10ms latency and over 99.999% availability.[2][3] Deployed across services like Google Drive, YouTube, Photos, Calendar, and Maps, Zanzibar ensures consistent, fine-grained permissions for billions of users without compromising speed or reliability.[2][4]
This comprehensive guide dives deep into Zanzibar’s architecture, data model, performance optimizations, real-world applications, and open-source alternatives. Whether you’re a systems engineer designing authorization for your own service or a developer curious about planetary-scale consistency, you’ll find practical insights, examples, and lessons from Google’s implementation.
The Problem Zanzibar Solves: Authorization at Planetary Scale
Modern applications like Google Workspace or YouTube deal with billions of users sharing trillions of objects—documents, videos, calendars, and more. Each object has complex permissions: owners, editors, viewers, group-based access, and nested hierarchies.[2][5]
Traditional approaches fall short:
- Per-service silos: Each app (e.g., Drive vs. Gmail) implementing its own auth leads to inconsistencies and duplication.
- Inter-service chatter: Constant permission checks between services kill performance at scale.
- Staleness risks: Distributed systems propagate changes slowly, risking users seeing data they shouldn’t (or missing data they should).[5]
Zanzibar centralizes this into a uniform data model and query language, providing external consistency—authorization decisions respect the causal order of user actions, even amid rapid ACL changes.[2] It targets <10ms p95 latency, processes 12.4 million checks/sec at peak, and achieves 3ms p50 / 20ms p99 latencies.[3]
Key Requirements: Error-free decisions, ultra-low latency, high availability, and support for filtering queries like “What documents can this user see?"[5]
Core Architecture: Nodes, Caches, and Spanner
Zanzibar is a distributed system with three pillars: nodes for request handling, edge caches for speed, and Spanner for durable storage.[1][3]
Zanzibar Nodes
Each node processes authorization requests and holds shards of data. Requests are consistent-hashed to specific nodes, boosting cache hit rates by concentrating related queries.[3] This deduplicates backend calls—multiple identical requests compute once and fan out results.
Nodes use Leopard, an in-memory indexing service for nested relationships. Instead of serial Spanner queries for group hierarchies (e.g., user → team → project), Leopard precomputes subgroups, reducing resolution to one index call.[3]
Edge Caches and Replication
- Multi-layer caching: Outermost is Leopard; then server-local caches (LRU-based); inter-service RPC caches.[5]
- Global replication: Like a CDN, Zanzibar instances worldwide keep data close to users. Writes to Spanner propagate with strong consistency.[3]
- Bounded staleness: Clients tolerate minimal staleness for cache freshness without breaking consistency.[4]
The Datastore: Google’s Spanner
At the heart is Spanner, Google’s globally distributed database offering external consistency—writes from anywhere are immediately visible everywhere, ordered causally.[3] Zanzibar stores tuples (user, resource, relation) here, enabling set-based policies over rigid ACLs.[1]
This setup yields 95th-percentile latency <10ms and >99.999% availability over years of production.[2]
Zanzibar’s Data Model: ReBAC and Tuple-Based Relations
Zanzibar’s genius is its Relation-Based Access Control (ReBAC) model, treating permissions as a directed acyclic graph (DAG) of relationships.[4]
The Triple (Tuple) Format
Permissions are stored as (user, object, relation) tuples:
- User: Individual, group, or userset (e.g., “user:alice”, “group:team-eng”).
- Object: Resource like “doc:123” or “folder:abc”.
- Relation: Permission type (e.g., “reader”, “editor”, “owner”).[1][4]
Example tuples:
user:alice@company.com -> doc:123#reader
group:team-eng -> doc:123#editor
doc:123 -> folder:projects#parent
user:bob@company.com -> group:team-eng#member
Graph Traversal for Checks
A “Can Alice read doc:123?” check traverses the graph:
- Direct: alice → doc:123#reader? Yes!
- Indirect: alice → group → doc:123#reader, or via folder hierarchies.[4]
This models complex policies like nested groups or delegated access (e.g., folder owners inherit child doc permissions).
Zanzibar Query Language (ZQL)
Developers define policies in a Turing-complete language with:
- Expressiveness: Unions, intersections, exclusions (e.g., “readers OR editors EXCEPT ex-employees”).
- Userset Rewrites: Compute dynamic sets (e.g., “all viewers of parent folder”).
- Efficiency: Optimized for runtime evaluation.[1]
Practical Example:
# Policy: Editors can read/write; viewers can read; inherit from parent.
doc:123#reader@any: viewer
doc:123#reader@editor: editor
doc:123#reader@parent#reader: folder
A check fans out recursively but cancels early if a path grants access (eager cancellation).[7]
Performance Optimizations: From Millions QPS to Milliseconds
Zanzibar’s scale—trillions of ACLs, millions QPS—relies on clever tricks.[2]
Caching Strategies
| Optimization | Description | Impact |
|---|---|---|
| Consistent Hashing | Routes similar requests to same nodes. | Higher cache hits; deduped computations.[3] |
| LRU Caches | Per-server, multi-layer (edge, RPC). | Avoids redundant Spanner reads.[4][5] |
| Leopard Indexing | In-memory nested group resolution. | Single call vs. serial queries.[3] |
Check Evaluation Engine
- Boolean Expression Tree: Check → recursive fan-out (indirect ACLs/groups).
- Concurrent Leaves: Parallel eval of base cases.
- Eager Cancellation: Short-circuit on decisive paths.
- Read Pooling: Batch identical Spanner RPCs.[7]
For “deep or wide” trees (many nested groups), Leopard pre-indexes.
Handling Filtering Queries
Zanzibar supports reverse queries (“What can user X access?”) via reverse indexes. Crucial for list endpoints (e.g., “User’s visible docs”).[5]
Failure Handling and Resilience
Zanzibar is built for the real world:
- Replication: Data multi-node for redundancy.[1]
- Failover: Traffic auto-reroutes on node failure.
- Watch API: Leopard syncs with Spanner changes in real-time.[3]
- Spanner Guarantees: No staleness; causal consistency prevents errors.[5]
Over 3+ years: 99.999% uptime, even at peak loads.[2]
Real-World Applications at Google
Zanzibar unifies auth across:
- Drive: Folder hierarchies, sharing links.
- YouTube: Video viewers, channel subscribers.
- Photos/Calendar: Event invites, album shares.
- Cloud/Maps: API scopes, collaborative edits.[2][4]
Case Study: Shared Document:
- Alice owns doc:123, adds Bob as editor.
- Tuple: bob → doc:123#editor.
- Bob shares with team-eng (he’s member).
- Check for team member: Graph traversal grants via multiple paths.
- Latency: <10ms globally.[4]
This consistency prevents leaks (e.g., applying old perms to new content).[5]
Open-Source Implementations and Alternatives
Google’s 2019 paper[2] inspired production systems:
- Authzed/SpiceDB: Zanzibar-compliant, uses PostgreSQL/CockroachDB. Handles 10M+ QPS.[3]
- Oso: Policy engine with ReBAC support.[5]
- Permify/Aperm: Kubernetes-native Zanzibar clones.
Example with SpiceDB (Go client):
// Define schema
schema := `
definition document {
relations define parent: folder
relations define reader: user or group or parent#reader
relations define editor: user or group
permissions: read { reader } write { editor }
}`
// Write tuples
client.Write(context.Background(), namespace.NewNamespaceWriteRequest(...))
// Check
checkReq := &v1t1.CheckRequest{
ResourceAndRelation: &structpb.Struct{ /* doc:123#read */ },
Subject: &v1t1.SubjectReference{ /* user:alice */ },
}
resp, _ := client.Check(context.Background(), checkReq)
fmt.Println(resp.Membership) // true/false
These scale to enterprise needs without Spanner.
Lessons for Building Your Own Authorization System
- Centralize with ReBAC: Graphs > lists for flexibility.
- Cache Aggressively: Multi-layer, consistent hashing.
- Strong Backend: Prioritize consistency (Spanner-like).
- Index Hierarchies: Avoid N+1 queries.
- Measure Everything: Aim for p99 <20ms.
Common pitfalls: Over-nesting (explodes traversal); ignoring reverse queries.
Challenges and Limitations
- Complexity: Graph policies hard to audit at scale.
- Vendor Lock: Google’s uses Spanner; OSS needs alternatives.
- Cost: In-memory indexes like Leopard are memory-hungry.[3]
- Evolution: Post-2019 updates (e.g., SpiceDB extensions) address gaps.
Despite this, Zanzibar’s design remains foundational.
Conclusion
Google Zanzibar redefined scalable authorization, proving a single system can serve billions with consistency, speed, and reliability. By modeling permissions as traversable graphs, leveraging Spanner’s guarantees, and optimizing with caches/indexes, it handles the impossible: real-time checks at planetary scale.[2][3]
For developers, the open paper and OSS ports democratize these ideas—build Zanzibar-inspired auth without reinventing the wheel. As services grow more collaborative, ReBAC systems like Zanzibar will be table stakes for secure, performant access control.