Introduction
Git has become the de‑facto standard for source‑code management, powering everything from tiny hobby projects to the world’s largest open‑source ecosystems. Its reputation for speed, integrity, and flexibility stems from a set of elegant, low‑level design decisions that were deliberately kept simple enough to be re‑implemented by a single developer in a weekend.
If you’ve ever wondered how Git works under the hood, building a tiny clone is the most effective way to find out. In this article we’ll walk through the core concepts that make Git possible, then construct a minimal, functional Git‑like system in Python. The goal isn’t to replace the official implementation, but to expose the plumbing that powers the high‑level commands you use daily.
By the end of this guide you will be able to:
- Explain Git’s object model (blobs, trees, commits, tags) and its SHA‑1 based storage.
- Implement the three‑stage workflow (working tree → index → repository) in code.
- Write a command‑line tool that can init, add, commit, log, checkout, and branch.
- Understand how references, packing, and garbage collection keep a real‑world repository performant.
- Extend the prototype with diff, merge, and remote capabilities.
The article is intentionally long (≈ 2,600 words) to give you a deep, hands‑on understanding. Feel free to skim sections you already know, but we recommend reading the whole thing if you plan to build tools that interact with Git or to contribute to its core.
Table of Contents
- Fundamentals of Git’s Architecture
1.1. The Three‑Stage Model
1.2. Content‑Addressable Storage - Git Objects in Detail
2.1. Blob
2.2. Tree
2.3. Commit
2.4. Tag - Designing a Minimal Repository Layout
- Implementing Core Commands in Python
4.1.git init→mygit init
4.2.git add→mygit add
4.3.git commit→mygit commit
4.4.git log→mygit log
4.5.git checkout→mygit checkout
4.6.git branch→mygit branch - The Index (Staging Area) Explained
- References, Heads, and Tags
- Packing Objects and Performance Considerations
- Extending the Prototype: Diff, Merge, and Remotes
- Testing, Debugging, and Safety Nets
- Conclusion
- Resources
Fundamentals of Git’s Architecture
The Three‑Stage Model
Git’s workflow is famously divided into three distinct areas:
| Stage | Description | Typical Files/Directories |
|---|---|---|
| Working Tree | The checkout of files that you edit. | src/, README.md, … |
| Index (Staging Area) | A snapshot of what will become the next commit. Stored as a binary file .git/index. | .git/index |
| Repository (Object Database) | Immutable objects stored under .git/objects/. | .git/objects/ |
Changes flow from the working tree → index → repository. The index is the only mutable structure; once an object is written to the object database, it never changes. This immutability is the key to Git’s robustness and enables powerful features like cheap branching.
Content‑Addressable Storage
Every object in Git is identified by a SHA‑1 hash (Git 2.29+ can optionally use SHA‑256). The hash is computed over a type header plus the raw content:
<type> <size>\0<content>
Because the hash is derived from the content, two identical files produce the same object ID, ensuring deduplication automatically. The object database is a simple two‑level directory layout:
.git/objects/
├── aa/
│ └── 1234567890abcdef...
├── b7/
│ └── 5c8f3a...
...
The first two hex digits become the subdirectory name, the remaining 38 characters become the filename. This design makes look‑ups O(1) on the filesystem and avoids the need for a separate index file.
Git Objects in Detail
Git stores four primary object types. Understanding them is essential before we start coding.
Blob
A blob (binary large object) holds the raw bytes of a file. No filename or metadata is stored inside a blob; those belong to the tree object that references it.
On disk (compressed):
$ echo "Hello, world!" | git hash-object -w --stdin
e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
The resulting SHA‑1 (e69de...) is the object ID.
Tree
A tree represents a directory. It contains a list of entries, each entry being:
<mode> <filename>\0<object-id>
mode encodes the file type (regular file, executable, symlink, sub‑directory). A tree can reference other trees (sub‑directories) and blobs (files). When you commit a snapshot of the entire project, Git creates a tree that recursively points to every file and sub‑directory.
Commit
A commit ties together:
- The root tree ID.
- Zero or more parent commit IDs (for merges).
- Author and committer information (name, email, timestamp).
- A free‑form commit message.
The commit object is what you see when you run git log. Because each commit points to a tree, the entire project history can be reconstructed by traversing parent links.
Tag
A tag is an optional, human‑readable reference that points to another object (usually a commit). Tags can be lightweight (just a name → object ID) or annotated (a full tag object containing a message, tagger, and optionally a PGP signature).
Designing a Minimal Repository Layout
Our prototype, mygit, will mimic the real .git directory but with fewer files:
myrepo/
├── .mygit/
│ ├── objects/
│ │ └── <2‑char>/<38‑char>
│ ├── refs/
│ │ ├── heads/
│ │ │ └── master
│ │ └── tags/
│ ├── HEAD # points to refs/heads/master
│ └── index # binary staging area (simplified)
└── <working files>
objects/– stores compressed blobs, trees, commits, tags.refs/heads/– each file contains the SHA‑1 of the tip of a branch.HEAD– a symbolic ref (ref: refs/heads/master) indicating the current branch.index– a JSON list for simplicity (real Git uses a binary format). Each entry stores:path,mode,sha1,mtime.
Why JSON? It speeds development and is human‑readable, while still illustrating the concept of a staging area.
Implementing Core Commands in Python
Below we present a single‑file prototype (mygit.py). The code is deliberately verbose to highlight the underlying mechanics. Feel free to split it into modules for a production‑ready tool.
Note: The implementation uses Python 3.9+ (type hints,
pathlib). It relies on the built‑inhashlibandzlibfor hashing and compression.
#!/usr/bin/env python3
"""
Minimal Git implementation (mygit) – supports init, add, commit, log,
checkout, branch.
"""
import argparse
import hashlib
import json
import os
import pathlib
import stat
import sys
import time
import zlib
from dataclasses import dataclass
from typing import List, Tuple, Dict
# ----------------------------------------------------------------------
# Helper functions
# ----------------------------------------------------------------------
def repo_path() -> pathlib.Path:
"""Find the .mygit directory walking up from cwd."""
cwd = pathlib.Path.cwd()
for parent in [cwd, *cwd.parents]:
if (parent / ".mygit").is_dir():
return parent / ".mygit"
raise RuntimeError("Not a mygit repository (or any parent).")
def git_hash(data: bytes, typ: str) -> str:
"""Calculate SHA‑1 over <type> <size>\\0<data>."""
header = f"{typ} {len(data)}".encode() + b'\x00'
full = header + data
return hashlib.sha1(full).hexdigest()
def write_object(typ: str, data: bytes) -> str:
"""Compress and store an object, return its SHA‑1."""
obj_id = git_hash(data, typ)
dir_path = repo_path() / "objects" / obj_id[:2]
dir_path.mkdir(parents=True, exist_ok=True)
file_path = dir_path / obj_id[2:]
if not file_path.exists():
compressed = zlib.compress(data)
file_path.write_bytes(compressed)
return obj_id
def read_object(obj_id: str) -> Tuple[str, bytes]:
"""Read and decompress an object, return (type, raw_data)."""
obj_path = repo_path() / "objects" / obj_id[:2] / obj_id[2:]
if not obj_path.is_file():
raise RuntimeError(f"Object {obj_id} not found")
compressed = obj_path.read_bytes()
raw = zlib.decompress(compressed)
# Split header
null_index = raw.index(b'\x00')
header = raw[:null_index].decode()
typ, size = header.split()
content = raw[null_index + 1 :]
assert len(content) == int(size)
return typ, content
# ----------------------------------------------------------------------
# Index handling (simplified JSON)
# ----------------------------------------------------------------------
INDEX_FILE = "index"
def load_index() -> List[Dict]:
idx_path = repo_path() / INDEX_FILE
if idx_path.is_file():
return json.loads(idx_path.read_text())
return []
def save_index(entries: List[Dict]) -> None:
idx_path = repo_path() / INDEX_FILE
idx_path.write_text(json.dumps(entries, indent=2))
def add_to_index(paths: List[pathlib.Path]) -> None:
"""Stage files – compute blob IDs and store in index."""
entries = load_index()
entry_by_path = {e["path"]: e for e in entries}
for p in paths:
if not p.is_file():
print(f"warning: {p} is not a regular file – skipping")
continue
data = p.read_bytes()
blob_id = write_object("blob", data)
st = p.stat()
mode = oct(st.st_mode & 0o7777) # keep permission bits
entry = {
"path": str(p.relative_to(repo_path().parent)),
"mode": mode,
"sha1": blob_id,
"mtime": st.st_mtime,
}
entry_by_path[entry["path"]] = entry
save_index(list(entry_by_path.values()))
# ----------------------------------------------------------------------
# Tree construction
# ----------------------------------------------------------------------
def build_tree(entries: List[Dict]) -> str:
"""
Build a tree object from index entries.
Returns the SHA‑1 of the created tree.
"""
# Group entries by directory
tree_map: Dict[str, List[Dict]] = {}
for e in entries:
parts = pathlib.Path(e["path"]).parts
if len(parts) == 1:
tree_map[""].append(e) if "" in tree_map else tree_map.update({"": [e]})
else:
# For simplicity we only build a flat root tree in this demo.
raise NotImplementedError("Nested directories not implemented in minimal version")
# Build root tree entries
lines = []
for e in entries:
mode = e["mode"]
name = pathlib.Path(e["path"]).name
sha = bytes.fromhex(e["sha1"])
lines.append(f"{mode} {name}\0".encode() + sha)
tree_data = b"".join(lines)
tree_id = write_object("tree", tree_data)
return tree_id
# ----------------------------------------------------------------------
# Commit creation
# ----------------------------------------------------------------------
def create_commit(tree_id: str, parent: str | None, message: str) -> str:
author = os.getenv("MYGIT_AUTHOR", "Anonymous <anon@example.com>")
timestamp = int(time.time())
tz = time.strftime("%z")
lines = [
f"tree {tree_id}",
]
if parent:
lines.append(f"parent {parent}")
lines.extend([
f"author {author} {timestamp} {tz}",
f"committer {author} {timestamp} {tz}",
"",
message,
""
])
commit_data = "\n".join(lines).encode()
return write_object("commit", commit_data)
# ----------------------------------------------------------------------
# Reference handling
# ----------------------------------------------------------------------
def ref_path(name: str) -> pathlib.Path:
"""Resolve a ref like 'refs/heads/master'."""
return repo_path() / name
def update_ref(name: str, value: str) -> None:
p = ref_path(name)
p.parent.mkdir(parents=True, exist_ok=True)
p.write_text(value + "\n")
def read_ref(name: str) -> str:
p = ref_path(name)
if not p.is_file():
raise RuntimeError(f"Reference {name} does not exist")
return p.read_text().strip()
def get_head() -> str:
head_path = repo_path() / "HEAD"
content = head_path.read_text().strip()
if content.startswith("ref:"):
ref = content.split(maxsplit=1)[1]
return read_ref(ref)
else:
# detached HEAD
return content
def set_head(ref: str) -> None:
head_path = repo_path() / "HEAD"
head_path.write_text(f"ref: {ref}\n")
# ----------------------------------------------------------------------
# Command implementations
# ----------------------------------------------------------------------
def cmd_init(args: argparse.Namespace) -> None:
repo = pathlib.Path(".mygit")
repo.mkdir(exist_ok=False)
(repo / "objects").mkdir()
(repo / "refs" / "heads").mkdir(parents=True)
(repo / "HEAD").write_text("ref: refs/heads/master\n")
print("Initialized empty mygit repository in", repo.resolve())
def cmd_add(args: argparse.Namespace) -> None:
paths = [pathlib.Path(p) for p in args.paths]
add_to_index(paths)
print(f"Added {len(paths)} path(s) to the index.")
def cmd_commit(args: argparse.Namespace) -> None:
entries = load_index()
if not entries:
print("nothing to commit, working tree clean")
return
tree_id = build_tree(entries)
parent = get_head() if (repo_path() / "HEAD").read_text().startswith("ref:") else None
commit_id = create_commit(tree_id, parent, args.message)
# Update current branch ref
head_content = (repo_path() / "HEAD").read_text()
if head_content.startswith("ref:"):
ref = head_content.split(maxsplit=1)[1]
update_ref(ref, commit_id)
else:
# detached HEAD
(repo_path() / "HEAD").write_text(commit_id + "\n")
# Clear the index
save_index([])
print(f"[{commit_id[:7]}] {args.message}")
def cmd_log(args: argparse.Namespace) -> None:
commit_id = get_head()
while commit_id:
typ, data = read_object(commit_id)
assert typ == "commit"
lines = data.decode().splitlines()
header = {}
i = 0
for i, line in enumerate(lines):
if line == "":
break
key, val = line.split(" ", 1)
header[key] = val
message = "\n".join(lines[i+1:])
print(f"commit {commit_id}")
print(f"Author: {header.get('author')}")
print(f"Date: {time.strftime('%c', time.gmtime(int(header['author'].split()[2])))}")
print()
print(f" {message}")
print()
commit_id = header.get("parent")
def cmd_branch(args: argparse.Namespace) -> None:
if args.name:
# Create new branch pointing at current HEAD
head = get_head()
update_ref(f"refs/heads/{args.name}", head)
print(f"Created branch {args.name} at {head[:7]}")
else:
# List branches, highlight current
heads = (repo_path() / "refs" / "heads").iterdir()
current_ref = (repo_path() / "HEAD").read_text().strip().split()[1]
for h in heads:
marker = "*" if f"refs/heads/{h.name}" == current_ref else " "
print(f"{marker} {h.name}")
def cmd_checkout(args: argparse.Namespace) -> None:
# Resolve ref or commit ID
target = args.target
ref_path = repo_path() / "refs" / "heads" / target
if ref_path.is_file():
commit_id = ref_path.read_text().strip()
set_head(f"refs/heads/{target}")
else:
# Assume raw commit SHA
commit_id = target
set_head(commit_id) # detached HEAD
# Clear working tree (dangerous! for demo purposes we just overwrite)
typ, data = read_object(commit_id)
assert typ == "commit"
tree_id = None
for line in data.decode().splitlines():
if line.startswith("tree "):
tree_id = line.split()[1]
break
if not tree_id:
raise RuntimeError("Commit missing tree")
# Recursively write files (only root for simplicity)
typ, tree_data = read_object(tree_id)
assert typ == "tree"
i = 0
while i < len(tree_data):
# Parse mode
space = tree_data.find(b' ', i)
mode = tree_data[i:space].decode()
i = space + 1
# Parse name
nul = tree_data.find(b'\x00', i)
name = tree_data[i:nul].decode()
i = nul + 1
# Parse SHA
sha = tree_data[i:i+20].hex()
i += 20
# Write file
typ2, content = read_object(sha)
if typ2 != "blob":
continue # ignore sub‑trees for now
pathlib.Path(name).write_bytes(content)
print(f"Checked out {target}")
# ----------------------------------------------------------------------
# CLI entry point
# ----------------------------------------------------------------------
def main() -> None:
parser = argparse.ArgumentParser(prog="mygit", description="A minimal Git clone")
sub = parser.add_subparsers(dest="cmd", required=True)
sub.add_parser("init", help="Create an empty repository")
add = sub.add_parser("add", help="Add file contents to the index")
add.add_argument("paths", nargs="+", help="Files to add")
commit = sub.add_parser("commit", help="Record changes to the repository")
commit.add_argument("-m", "--message", required=True, help="Commit message")
sub.add_parser("log", help="Show commit logs")
branch = sub.add_parser("branch", help="List or create branches")
branch.add_argument("name", nargs="?", help="Branch name to create")
checkout = sub.add_parser("checkout", help="Switch branches or restore working tree files")
checkout.add_argument("target", help="Branch name or commit SHA")
args = parser.parse_args()
# Dispatch
if args.cmd == "init":
cmd_init(args)
elif args.cmd == "add":
cmd_add(args)
elif args.cmd == "commit":
cmd_commit(args)
elif args.cmd == "log":
cmd_log(args)
elif args.cmd == "branch":
cmd_branch(args)
elif args.cmd == "checkout":
cmd_checkout(args)
else:
parser.error("unknown command")
if __name__ == "__main__":
main()
Explanation of the Core Flow
initcreates the.mygitdirectory hierarchy.addreads each file, creates a blob object, and stores a JSON entry in the index.commitbuilds a tree from the index (the demo only supports a flat root tree for brevity), creates a commit object linking to the tree and optional parent, then updates the current branch reference.logwalks parent links, printing commit metadata.checkoutresolves a branch name or raw SHA, updatesHEAD, then extracts the root tree back to the working directory.branchlists existing heads or creates a new one pointing atHEAD.
The prototype deliberately omits many production features (nested directories, proper binary index format, packed objects, network transport). Yet it demonstrates the exact same data flow Git uses internally.
The Index (Staging Area) Explained
The real Git index (.git/index) is a binary file with a complex layout designed for speed:
- Header – magic number, version, entry count.
- Entry list – each entry stores
ctime,mtime,dev,ino,mode,uid,gid,size,sha1, flags, and the path. - Extended cache – optional entries for rename detection, submodule handling, etc.
Our simplified JSON index sacrifices performance for readability, but the underlying idea is identical: the index records the exact object ID each path should have in the next commit. When you run git commit, Git simply serialises the index into a tree object.
If you wish to extend the prototype, you can replace the JSON file with the official format using the struct module and the exact layout described in Documentation/technical/index-format.txt of the Git source tree.
References, Heads, and Tags
Symbolic vs. Direct References
- Symbolic refs (
ref: refs/heads/master) point to another ref.HEADis normally symbolic. - Direct refs store a raw SHA‑1. When you detach HEAD (
git checkout <sha>),HEADbecomes direct.
Both are stored as plain text files, which makes reference updates atomic (a simple write+rename).
Tag Objects
Adding a lightweight tag in our prototype would be as easy as:
def create_lightweight_tag(name: str, target: str) -> None:
update_ref(f"refs/tags/{name}", target)
An annotated tag would require constructing a tag object (type tag) with a message and optionally a PGP signature. The object would then be stored and the ref would point at its SHA‑1.
Packing Objects and Performance Considerations
A real Git repository can contain millions of loose objects. Storing each as a separate file quickly becomes a bottleneck (filesystem limits, inode usage). Git solves this by:
- Packfiles (
*.pack) – collections of objects compressed together with delta compression (store only differences between similar objects). - Index files (
*.idx) – a binary lookup table mapping object IDs to offsets inside the packfile. - Garbage collection (
git gc) – rewrites loose objects into packs, removes unreachable objects, and optimises the packfile.
Implementing a packfile writer is non‑trivial but educational. A minimal version would:
- Sort objects by type and size.
- Use
zlibto compress each object. - For each object after the first, try to encode it as a delta against a similar previous object (e.g., using a simple binary diff algorithm).
- Append a header indicating whether the entry is a full object or a delta.
The delta format used by Git is a custom byte‑code that encodes copy/insert commands. Understanding it is beneficial if you ever need to write a custom transport or a forensic tool.
Extending the Prototype: Diff, Merge, and Remotes
Diff
Git’s diff is a two‑way comparison of two tree objects. A simple implementation can:
- Load the two trees.
- Build dictionaries mapping file paths → blob IDs.
- For each path, compare the blob IDs; if they differ, decompress the blobs and run a line‑based diff (
difflib.unified_diffin Python).
import difflib
def diff_trees(tree_a: str, tree_b: str) -> None:
# Simplified – only root entries
entries_a = parse_tree(tree_a)
entries_b = parse_tree(tree_b)
all_paths = set(entries_a) | set(entries_b)
for p in sorted(all_paths):
sha_a = entries_a.get(p)
sha_b = entries_b.get(p)
if sha_a == sha_b:
continue
content_a = read_blob(sha_a) if sha_a else b""
content_b = read_blob(sha_b) if sha_b else b""
for line in difflib.unified_diff(
content_a.decode().splitlines(),
content_b.decode().splitlines(),
fromfile=f"a/{p}",
tofile=f"b/{p}",
lineterm="",
):
print(line)
Merge
A three‑way merge uses a base commit (common ancestor) and two heads. The algorithm:
- Resolve the three trees.
- For each path, compare base vs. ours vs. theirs.
- If both sides made the same change, accept it.
- If changes conflict, write both versions into the file with conflict markers (
<<<<<<<,=======,>>>>>>>).
Implementing a full recursive merge is beyond the scope of this article, but the core logic fits in a few hundred lines of Python.
Remotes (Network Transport)
Git’s network protocol is built on top of packfiles transmitted over HTTP, SSH, or the native Git protocol. A minimal remote implementation can:
- Use bare repositories (no working tree) on the server side.
- Transfer objects via simple HTTP GET/POST where the body is a concatenated series of raw objects (
application/octet-stream). - Store received objects into the local object database.
The official Git HTTP/Smart protocol is documented in Documentation/technical/http-protocol.txt. For experimental purposes, you can start a tiny Flask server that serves objects from its .mygit/objects directory and accepts POSTs to push new objects.
Testing, Debugging, and Safety Nets
When building a version‑control system, data integrity is non‑negotiable. Here are practical steps to keep your prototype reliable:
- Unit tests for each object type – verify that
write_object+read_objectround‑trips correctly. - Property‑based testing (e.g., with
hypothesis) – generate random file trees, commit them, then checkout and compare against the original. - Filesystem sandbox – run every test inside a temporary directory (
tempfile.TemporaryDirectory) to avoid contaminating real projects. - SHA‑1 verification – after writing an object, recompute its hash and confirm it matches the filename.
- Atomic file updates – when updating refs or the index, write to a temporary file then rename (
os.replace) to avoid corruption on crashes. - Safety warnings – the
checkoutcommand in our prototype naïvely overwrites files. In production you would first verify that the working tree is clean or prompt the user.
Conclusion
Implementing a Git‑like system from scratch is an excellent way to demystify the magic behind one of the most influential tools in modern software development. In this article we:
- Covered Git’s core architecture—working tree, index, and object database.
- Explained the four fundamental object types (blob, tree, commit, tag) and how they’re addressed by SHA‑1.
- Designed a minimal repository layout (
.mygit) and built a fully functional command‑line interface in Python. - Walked through each command (
init,add,commit,log,checkout,branch) and highlighted the underlying data transformations. - Discussed the index format, reference handling, packing, and how to extend the prototype with diff, merge, and remote support.
- Provided practical testing and safety advice to keep your implementation robust.
Even though the prototype is deliberately simplified (flat trees, JSON index, no packing), the same principles scale to the full Git codebase. By reading the official source, experimenting with the mygit tool, and gradually adding features (nested directories, binary index, packfiles, PGP‑signed tags), you can evolve a learning project into a powerful custom VCS or a teaching aid for teams new to version control.
Happy hacking, and may your commits always be fast and your histories clean!
Resources
Git Documentation – The canonical source for every Git concept, including the plumbing commands used in this article.
Git Book (Pro Git)Git Source Code – Browse the real implementation (C language) to see how objects, packing, and refs are handled.
Git on GitHubGit Internals – A Deep Dive – A series of articles by Scott Chacon that explains the object model, packfiles, and more in approachable prose.
Git Internals – The Object ModelPython
hashlib&zlibDocs – Useful reference for the cryptographic hash and compression functions used in the prototype.
hashlib – Python 3.12 Documentation
zlib – Python 3.12 DocumentationGit’s HTTP/Smart Protocol Specification – If you decide to implement remote push/pull over HTTP.
Git HTTP Protocol Documentationdifflib– Unified Diff in Python – Handy for building a simple diff command.
difflib – Python Documentation
These resources should give you a solid foundation to explore beyond the minimal implementation and dive into the full capabilities of Git. Happy coding!