Introduction

Git has become the de‑facto standard for source‑code management, powering everything from tiny hobby projects to the world’s largest open‑source ecosystems. Its reputation for speed, integrity, and flexibility stems from a set of elegant, low‑level design decisions that were deliberately kept simple enough to be re‑implemented by a single developer in a weekend.

If you’ve ever wondered how Git works under the hood, building a tiny clone is the most effective way to find out. In this article we’ll walk through the core concepts that make Git possible, then construct a minimal, functional Git‑like system in Python. The goal isn’t to replace the official implementation, but to expose the plumbing that powers the high‑level commands you use daily.

By the end of this guide you will be able to:

  • Explain Git’s object model (blobs, trees, commits, tags) and its SHA‑1 based storage.
  • Implement the three‑stage workflow (working tree → index → repository) in code.
  • Write a command‑line tool that can init, add, commit, log, checkout, and branch.
  • Understand how references, packing, and garbage collection keep a real‑world repository performant.
  • Extend the prototype with diff, merge, and remote capabilities.

The article is intentionally long (≈ 2,600 words) to give you a deep, hands‑on understanding. Feel free to skim sections you already know, but we recommend reading the whole thing if you plan to build tools that interact with Git or to contribute to its core.


Table of Contents

  1. Fundamentals of Git’s Architecture
    1.1. The Three‑Stage Model
    1.2. Content‑Addressable Storage
  2. Git Objects in Detail
    2.1. Blob
    2.2. Tree
    2.3. Commit
    2.4. Tag
  3. Designing a Minimal Repository Layout
  4. Implementing Core Commands in Python
    4.1. git initmygit init
    4.2. git addmygit add
    4.3. git commitmygit commit
    4.4. git logmygit log
    4.5. git checkoutmygit checkout
    4.6. git branchmygit branch
  5. The Index (Staging Area) Explained
  6. References, Heads, and Tags
  7. Packing Objects and Performance Considerations
  8. Extending the Prototype: Diff, Merge, and Remotes
  9. Testing, Debugging, and Safety Nets
  10. Conclusion
  11. Resources

Fundamentals of Git’s Architecture

The Three‑Stage Model

Git’s workflow is famously divided into three distinct areas:

StageDescriptionTypical Files/Directories
Working TreeThe checkout of files that you edit.src/, README.md, …
Index (Staging Area)A snapshot of what will become the next commit. Stored as a binary file .git/index..git/index
Repository (Object Database)Immutable objects stored under .git/objects/..git/objects/

Changes flow from the working tree → index → repository. The index is the only mutable structure; once an object is written to the object database, it never changes. This immutability is the key to Git’s robustness and enables powerful features like cheap branching.

Content‑Addressable Storage

Every object in Git is identified by a SHA‑1 hash (Git 2.29+ can optionally use SHA‑256). The hash is computed over a type header plus the raw content:

<type> <size>\0<content>

Because the hash is derived from the content, two identical files produce the same object ID, ensuring deduplication automatically. The object database is a simple two‑level directory layout:

.git/objects/
  ├── aa/
  │   └── 1234567890abcdef...
  ├── b7/
  │   └── 5c8f3a...
  ...

The first two hex digits become the subdirectory name, the remaining 38 characters become the filename. This design makes look‑ups O(1) on the filesystem and avoids the need for a separate index file.


Git Objects in Detail

Git stores four primary object types. Understanding them is essential before we start coding.

Blob

A blob (binary large object) holds the raw bytes of a file. No filename or metadata is stored inside a blob; those belong to the tree object that references it.

On disk (compressed):

$ echo "Hello, world!" | git hash-object -w --stdin
e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

The resulting SHA‑1 (e69de...) is the object ID.

Tree

A tree represents a directory. It contains a list of entries, each entry being:

<mode> <filename>\0<object-id>

mode encodes the file type (regular file, executable, symlink, sub‑directory). A tree can reference other trees (sub‑directories) and blobs (files). When you commit a snapshot of the entire project, Git creates a tree that recursively points to every file and sub‑directory.

Commit

A commit ties together:

  • The root tree ID.
  • Zero or more parent commit IDs (for merges).
  • Author and committer information (name, email, timestamp).
  • A free‑form commit message.

The commit object is what you see when you run git log. Because each commit points to a tree, the entire project history can be reconstructed by traversing parent links.

Tag

A tag is an optional, human‑readable reference that points to another object (usually a commit). Tags can be lightweight (just a name → object ID) or annotated (a full tag object containing a message, tagger, and optionally a PGP signature).


Designing a Minimal Repository Layout

Our prototype, mygit, will mimic the real .git directory but with fewer files:

myrepo/
  ├── .mygit/
  │   ├── objects/
  │   │   └── <2‑char>/<38‑char>
  │   ├── refs/
  │   │   ├── heads/
  │   │   │   └── master
  │   │   └── tags/
  │   ├── HEAD               # points to refs/heads/master
  │   └── index              # binary staging area (simplified)
  └── <working files>
  • objects/ – stores compressed blobs, trees, commits, tags.
  • refs/heads/ – each file contains the SHA‑1 of the tip of a branch.
  • HEAD – a symbolic ref (ref: refs/heads/master) indicating the current branch.
  • index – a JSON list for simplicity (real Git uses a binary format). Each entry stores: path, mode, sha1, mtime.

Why JSON? It speeds development and is human‑readable, while still illustrating the concept of a staging area.


Implementing Core Commands in Python

Below we present a single‑file prototype (mygit.py). The code is deliberately verbose to highlight the underlying mechanics. Feel free to split it into modules for a production‑ready tool.

Note: The implementation uses Python 3.9+ (type hints, pathlib). It relies on the built‑in hashlib and zlib for hashing and compression.

#!/usr/bin/env python3
"""
Minimal Git implementation (mygit) – supports init, add, commit, log,
checkout, branch.
"""

import argparse
import hashlib
import json
import os
import pathlib
import stat
import sys
import time
import zlib
from dataclasses import dataclass
from typing import List, Tuple, Dict

# ----------------------------------------------------------------------
# Helper functions
# ----------------------------------------------------------------------
def repo_path() -> pathlib.Path:
    """Find the .mygit directory walking up from cwd."""
    cwd = pathlib.Path.cwd()
    for parent in [cwd, *cwd.parents]:
        if (parent / ".mygit").is_dir():
            return parent / ".mygit"
    raise RuntimeError("Not a mygit repository (or any parent).")


def git_hash(data: bytes, typ: str) -> str:
    """Calculate SHA‑1 over <type> <size>\\0<data>."""
    header = f"{typ} {len(data)}".encode() + b'\x00'
    full = header + data
    return hashlib.sha1(full).hexdigest()


def write_object(typ: str, data: bytes) -> str:
    """Compress and store an object, return its SHA‑1."""
    obj_id = git_hash(data, typ)
    dir_path = repo_path() / "objects" / obj_id[:2]
    dir_path.mkdir(parents=True, exist_ok=True)
    file_path = dir_path / obj_id[2:]

    if not file_path.exists():
        compressed = zlib.compress(data)
        file_path.write_bytes(compressed)
    return obj_id


def read_object(obj_id: str) -> Tuple[str, bytes]:
    """Read and decompress an object, return (type, raw_data)."""
    obj_path = repo_path() / "objects" / obj_id[:2] / obj_id[2:]
    if not obj_path.is_file():
        raise RuntimeError(f"Object {obj_id} not found")
    compressed = obj_path.read_bytes()
    raw = zlib.decompress(compressed)

    # Split header
    null_index = raw.index(b'\x00')
    header = raw[:null_index].decode()
    typ, size = header.split()
    content = raw[null_index + 1 :]
    assert len(content) == int(size)
    return typ, content


# ----------------------------------------------------------------------
# Index handling (simplified JSON)
# ----------------------------------------------------------------------
INDEX_FILE = "index"


def load_index() -> List[Dict]:
    idx_path = repo_path() / INDEX_FILE
    if idx_path.is_file():
        return json.loads(idx_path.read_text())
    return []


def save_index(entries: List[Dict]) -> None:
    idx_path = repo_path() / INDEX_FILE
    idx_path.write_text(json.dumps(entries, indent=2))


def add_to_index(paths: List[pathlib.Path]) -> None:
    """Stage files – compute blob IDs and store in index."""
    entries = load_index()
    entry_by_path = {e["path"]: e for e in entries}

    for p in paths:
        if not p.is_file():
            print(f"warning: {p} is not a regular file – skipping")
            continue

        data = p.read_bytes()
        blob_id = write_object("blob", data)

        st = p.stat()
        mode = oct(st.st_mode & 0o7777)  # keep permission bits

        entry = {
            "path": str(p.relative_to(repo_path().parent)),
            "mode": mode,
            "sha1": blob_id,
            "mtime": st.st_mtime,
        }
        entry_by_path[entry["path"]] = entry

    save_index(list(entry_by_path.values()))


# ----------------------------------------------------------------------
# Tree construction
# ----------------------------------------------------------------------
def build_tree(entries: List[Dict]) -> str:
    """
    Build a tree object from index entries.
    Returns the SHA‑1 of the created tree.
    """
    # Group entries by directory
    tree_map: Dict[str, List[Dict]] = {}
    for e in entries:
        parts = pathlib.Path(e["path"]).parts
        if len(parts) == 1:
            tree_map[""].append(e) if "" in tree_map else tree_map.update({"": [e]})
        else:
            # For simplicity we only build a flat root tree in this demo.
            raise NotImplementedError("Nested directories not implemented in minimal version")

    # Build root tree entries
    lines = []
    for e in entries:
        mode = e["mode"]
        name = pathlib.Path(e["path"]).name
        sha = bytes.fromhex(e["sha1"])
        lines.append(f"{mode} {name}\0".encode() + sha)

    tree_data = b"".join(lines)
    tree_id = write_object("tree", tree_data)
    return tree_id


# ----------------------------------------------------------------------
# Commit creation
# ----------------------------------------------------------------------
def create_commit(tree_id: str, parent: str | None, message: str) -> str:
    author = os.getenv("MYGIT_AUTHOR", "Anonymous <anon@example.com>")
    timestamp = int(time.time())
    tz = time.strftime("%z")
    lines = [
        f"tree {tree_id}",
    ]
    if parent:
        lines.append(f"parent {parent}")
    lines.extend([
        f"author {author} {timestamp} {tz}",
        f"committer {author} {timestamp} {tz}",
        "",
        message,
        ""
    ])
    commit_data = "\n".join(lines).encode()
    return write_object("commit", commit_data)


# ----------------------------------------------------------------------
# Reference handling
# ----------------------------------------------------------------------
def ref_path(name: str) -> pathlib.Path:
    """Resolve a ref like 'refs/heads/master'."""
    return repo_path() / name


def update_ref(name: str, value: str) -> None:
    p = ref_path(name)
    p.parent.mkdir(parents=True, exist_ok=True)
    p.write_text(value + "\n")


def read_ref(name: str) -> str:
    p = ref_path(name)
    if not p.is_file():
        raise RuntimeError(f"Reference {name} does not exist")
    return p.read_text().strip()


def get_head() -> str:
    head_path = repo_path() / "HEAD"
    content = head_path.read_text().strip()
    if content.startswith("ref:"):
        ref = content.split(maxsplit=1)[1]
        return read_ref(ref)
    else:
        # detached HEAD
        return content


def set_head(ref: str) -> None:
    head_path = repo_path() / "HEAD"
    head_path.write_text(f"ref: {ref}\n")


# ----------------------------------------------------------------------
# Command implementations
# ----------------------------------------------------------------------
def cmd_init(args: argparse.Namespace) -> None:
    repo = pathlib.Path(".mygit")
    repo.mkdir(exist_ok=False)
    (repo / "objects").mkdir()
    (repo / "refs" / "heads").mkdir(parents=True)
    (repo / "HEAD").write_text("ref: refs/heads/master\n")
    print("Initialized empty mygit repository in", repo.resolve())


def cmd_add(args: argparse.Namespace) -> None:
    paths = [pathlib.Path(p) for p in args.paths]
    add_to_index(paths)
    print(f"Added {len(paths)} path(s) to the index.")


def cmd_commit(args: argparse.Namespace) -> None:
    entries = load_index()
    if not entries:
        print("nothing to commit, working tree clean")
        return

    tree_id = build_tree(entries)
    parent = get_head() if (repo_path() / "HEAD").read_text().startswith("ref:") else None
    commit_id = create_commit(tree_id, parent, args.message)
    # Update current branch ref
    head_content = (repo_path() / "HEAD").read_text()
    if head_content.startswith("ref:"):
        ref = head_content.split(maxsplit=1)[1]
        update_ref(ref, commit_id)
    else:
        # detached HEAD
        (repo_path() / "HEAD").write_text(commit_id + "\n")
    # Clear the index
    save_index([])
    print(f"[{commit_id[:7]}] {args.message}")


def cmd_log(args: argparse.Namespace) -> None:
    commit_id = get_head()
    while commit_id:
        typ, data = read_object(commit_id)
        assert typ == "commit"
        lines = data.decode().splitlines()
        header = {}
        i = 0
        for i, line in enumerate(lines):
            if line == "":
                break
            key, val = line.split(" ", 1)
            header[key] = val
        message = "\n".join(lines[i+1:])
        print(f"commit {commit_id}")
        print(f"Author: {header.get('author')}")
        print(f"Date:   {time.strftime('%c', time.gmtime(int(header['author'].split()[2])))}")
        print()
        print(f"    {message}")
        print()
        commit_id = header.get("parent")


def cmd_branch(args: argparse.Namespace) -> None:
    if args.name:
        # Create new branch pointing at current HEAD
        head = get_head()
        update_ref(f"refs/heads/{args.name}", head)
        print(f"Created branch {args.name} at {head[:7]}")
    else:
        # List branches, highlight current
        heads = (repo_path() / "refs" / "heads").iterdir()
        current_ref = (repo_path() / "HEAD").read_text().strip().split()[1]
        for h in heads:
            marker = "*" if f"refs/heads/{h.name}" == current_ref else " "
            print(f"{marker} {h.name}")


def cmd_checkout(args: argparse.Namespace) -> None:
    # Resolve ref or commit ID
    target = args.target
    ref_path = repo_path() / "refs" / "heads" / target
    if ref_path.is_file():
        commit_id = ref_path.read_text().strip()
        set_head(f"refs/heads/{target}")
    else:
        # Assume raw commit SHA
        commit_id = target
        set_head(commit_id)  # detached HEAD

    # Clear working tree (dangerous! for demo purposes we just overwrite)
    typ, data = read_object(commit_id)
    assert typ == "commit"
    tree_id = None
    for line in data.decode().splitlines():
        if line.startswith("tree "):
            tree_id = line.split()[1]
            break
    if not tree_id:
        raise RuntimeError("Commit missing tree")
    # Recursively write files (only root for simplicity)
    typ, tree_data = read_object(tree_id)
    assert typ == "tree"
    i = 0
    while i < len(tree_data):
        # Parse mode
        space = tree_data.find(b' ', i)
        mode = tree_data[i:space].decode()
        i = space + 1
        # Parse name
        nul = tree_data.find(b'\x00', i)
        name = tree_data[i:nul].decode()
        i = nul + 1
        # Parse SHA
        sha = tree_data[i:i+20].hex()
        i += 20
        # Write file
        typ2, content = read_object(sha)
        if typ2 != "blob":
            continue  # ignore sub‑trees for now
        pathlib.Path(name).write_bytes(content)
    print(f"Checked out {target}")


# ----------------------------------------------------------------------
# CLI entry point
# ----------------------------------------------------------------------
def main() -> None:
    parser = argparse.ArgumentParser(prog="mygit", description="A minimal Git clone")
    sub = parser.add_subparsers(dest="cmd", required=True)

    sub.add_parser("init", help="Create an empty repository")
    add = sub.add_parser("add", help="Add file contents to the index")
    add.add_argument("paths", nargs="+", help="Files to add")

    commit = sub.add_parser("commit", help="Record changes to the repository")
    commit.add_argument("-m", "--message", required=True, help="Commit message")

    sub.add_parser("log", help="Show commit logs")
    branch = sub.add_parser("branch", help="List or create branches")
    branch.add_argument("name", nargs="?", help="Branch name to create")

    checkout = sub.add_parser("checkout", help="Switch branches or restore working tree files")
    checkout.add_argument("target", help="Branch name or commit SHA")

    args = parser.parse_args()
    # Dispatch
    if args.cmd == "init":
        cmd_init(args)
    elif args.cmd == "add":
        cmd_add(args)
    elif args.cmd == "commit":
        cmd_commit(args)
    elif args.cmd == "log":
        cmd_log(args)
    elif args.cmd == "branch":
        cmd_branch(args)
    elif args.cmd == "checkout":
        cmd_checkout(args)
    else:
        parser.error("unknown command")


if __name__ == "__main__":
    main()

Explanation of the Core Flow

  1. init creates the .mygit directory hierarchy.
  2. add reads each file, creates a blob object, and stores a JSON entry in the index.
  3. commit builds a tree from the index (the demo only supports a flat root tree for brevity), creates a commit object linking to the tree and optional parent, then updates the current branch reference.
  4. log walks parent links, printing commit metadata.
  5. checkout resolves a branch name or raw SHA, updates HEAD, then extracts the root tree back to the working directory.
  6. branch lists existing heads or creates a new one pointing at HEAD.

The prototype deliberately omits many production features (nested directories, proper binary index format, packed objects, network transport). Yet it demonstrates the exact same data flow Git uses internally.


The Index (Staging Area) Explained

The real Git index (.git/index) is a binary file with a complex layout designed for speed:

  • Header – magic number, version, entry count.
  • Entry list – each entry stores ctime, mtime, dev, ino, mode, uid, gid, size, sha1, flags, and the path.
  • Extended cache – optional entries for rename detection, submodule handling, etc.

Our simplified JSON index sacrifices performance for readability, but the underlying idea is identical: the index records the exact object ID each path should have in the next commit. When you run git commit, Git simply serialises the index into a tree object.

If you wish to extend the prototype, you can replace the JSON file with the official format using the struct module and the exact layout described in Documentation/technical/index-format.txt of the Git source tree.


References, Heads, and Tags

Symbolic vs. Direct References

  • Symbolic refs (ref: refs/heads/master) point to another ref. HEAD is normally symbolic.
  • Direct refs store a raw SHA‑1. When you detach HEAD (git checkout <sha>), HEAD becomes direct.

Both are stored as plain text files, which makes reference updates atomic (a simple write+rename).

Tag Objects

Adding a lightweight tag in our prototype would be as easy as:

def create_lightweight_tag(name: str, target: str) -> None:
    update_ref(f"refs/tags/{name}", target)

An annotated tag would require constructing a tag object (type tag) with a message and optionally a PGP signature. The object would then be stored and the ref would point at its SHA‑1.


Packing Objects and Performance Considerations

A real Git repository can contain millions of loose objects. Storing each as a separate file quickly becomes a bottleneck (filesystem limits, inode usage). Git solves this by:

  1. Packfiles (*.pack) – collections of objects compressed together with delta compression (store only differences between similar objects).
  2. Index files (*.idx) – a binary lookup table mapping object IDs to offsets inside the packfile.
  3. Garbage collection (git gc) – rewrites loose objects into packs, removes unreachable objects, and optimises the packfile.

Implementing a packfile writer is non‑trivial but educational. A minimal version would:

  • Sort objects by type and size.
  • Use zlib to compress each object.
  • For each object after the first, try to encode it as a delta against a similar previous object (e.g., using a simple binary diff algorithm).
  • Append a header indicating whether the entry is a full object or a delta.

The delta format used by Git is a custom byte‑code that encodes copy/insert commands. Understanding it is beneficial if you ever need to write a custom transport or a forensic tool.


Extending the Prototype: Diff, Merge, and Remotes

Diff

Git’s diff is a two‑way comparison of two tree objects. A simple implementation can:

  1. Load the two trees.
  2. Build dictionaries mapping file paths → blob IDs.
  3. For each path, compare the blob IDs; if they differ, decompress the blobs and run a line‑based diff (difflib.unified_diff in Python).
import difflib

def diff_trees(tree_a: str, tree_b: str) -> None:
    # Simplified – only root entries
    entries_a = parse_tree(tree_a)
    entries_b = parse_tree(tree_b)
    all_paths = set(entries_a) | set(entries_b)
    for p in sorted(all_paths):
        sha_a = entries_a.get(p)
        sha_b = entries_b.get(p)
        if sha_a == sha_b:
            continue
        content_a = read_blob(sha_a) if sha_a else b""
        content_b = read_blob(sha_b) if sha_b else b""
        for line in difflib.unified_diff(
            content_a.decode().splitlines(),
            content_b.decode().splitlines(),
            fromfile=f"a/{p}",
            tofile=f"b/{p}",
            lineterm="",
        ):
            print(line)

Merge

A three‑way merge uses a base commit (common ancestor) and two heads. The algorithm:

  1. Resolve the three trees.
  2. For each path, compare base vs. ours vs. theirs.
  3. If both sides made the same change, accept it.
  4. If changes conflict, write both versions into the file with conflict markers (<<<<<<<, =======, >>>>>>>).

Implementing a full recursive merge is beyond the scope of this article, but the core logic fits in a few hundred lines of Python.

Remotes (Network Transport)

Git’s network protocol is built on top of packfiles transmitted over HTTP, SSH, or the native Git protocol. A minimal remote implementation can:

  • Use bare repositories (no working tree) on the server side.
  • Transfer objects via simple HTTP GET/POST where the body is a concatenated series of raw objects (application/octet-stream).
  • Store received objects into the local object database.

The official Git HTTP/Smart protocol is documented in Documentation/technical/http-protocol.txt. For experimental purposes, you can start a tiny Flask server that serves objects from its .mygit/objects directory and accepts POSTs to push new objects.


Testing, Debugging, and Safety Nets

When building a version‑control system, data integrity is non‑negotiable. Here are practical steps to keep your prototype reliable:

  1. Unit tests for each object type – verify that write_object + read_object round‑trips correctly.
  2. Property‑based testing (e.g., with hypothesis) – generate random file trees, commit them, then checkout and compare against the original.
  3. Filesystem sandbox – run every test inside a temporary directory (tempfile.TemporaryDirectory) to avoid contaminating real projects.
  4. SHA‑1 verification – after writing an object, recompute its hash and confirm it matches the filename.
  5. Atomic file updates – when updating refs or the index, write to a temporary file then rename (os.replace) to avoid corruption on crashes.
  6. Safety warnings – the checkout command in our prototype naïvely overwrites files. In production you would first verify that the working tree is clean or prompt the user.

Conclusion

Implementing a Git‑like system from scratch is an excellent way to demystify the magic behind one of the most influential tools in modern software development. In this article we:

  • Covered Git’s core architecture—working tree, index, and object database.
  • Explained the four fundamental object types (blob, tree, commit, tag) and how they’re addressed by SHA‑1.
  • Designed a minimal repository layout (.mygit) and built a fully functional command‑line interface in Python.
  • Walked through each command (init, add, commit, log, checkout, branch) and highlighted the underlying data transformations.
  • Discussed the index format, reference handling, packing, and how to extend the prototype with diff, merge, and remote support.
  • Provided practical testing and safety advice to keep your implementation robust.

Even though the prototype is deliberately simplified (flat trees, JSON index, no packing), the same principles scale to the full Git codebase. By reading the official source, experimenting with the mygit tool, and gradually adding features (nested directories, binary index, packfiles, PGP‑signed tags), you can evolve a learning project into a powerful custom VCS or a teaching aid for teams new to version control.

Happy hacking, and may your commits always be fast and your histories clean!


Resources

These resources should give you a solid foundation to explore beyond the minimal implementation and dive into the full capabilities of Git. Happy coding!