Skip to main content

Overview

The commit-graph file is a supplemental data structure that dramatically accelerates commit graph operations. It stores commit relationships and metadata in an optimized format that avoids the need to decompress and parse individual commit objects.
The commit-graph is stored at $GIT_DIR/objects/info/commit-graph or in the commit-graphs/ directory for chains.

Why Commit Graphs?

Git operations that walk the commit graph face two primary performance bottlenecks:
  1. Decompressing and parsing commits - Each commit object must be inflated from zlib format
  2. Walking the entire graph - Topological constraints require extensive traversal

Use Cases

Merge Base

Computing merge bases for status checks and merges can take minutes in large repositories.

History Listing

Filtering and traversing commit history benefits from fast parent lookups.

Graph Visualization

Tools like git log --graph need efficient topological ordering.

Reachability

Determining if one commit can reach another is accelerated by generation numbers.

File Format

Header (8 bytes)

4 bytes - Signature: 'C', 'G', 'P', 'H'
1 byte  - Version number (currently 1)
1 byte  - Hash version:
          1 = SHA-1 (20-byte hashes)
          2 = SHA-256 (32-byte hashes)
1 byte  - Number of chunks (C)
1 byte  - Number of base graphs (B)
import struct

def read_commit_graph_header(filepath):
    with open(filepath, 'rb') as f:
        signature = f.read(4)
        if signature != b'CGPH':
            raise ValueError(f"Invalid signature: {signature}")
        
        version = struct.unpack('B', f.read(1))[0]
        hash_version = struct.unpack('B', f.read(1))[0]
        num_chunks = struct.unpack('B', f.read(1))[0]
        num_base_graphs = struct.unpack('B', f.read(1))[0]
        
        hash_len = 20 if hash_version == 1 else 32
        
        return {
            'version': version,
            'hash_version': hash_version,
            'hash_len': hash_len,
            'num_chunks': num_chunks,
            'num_base_graphs': num_base_graphs
        }

Chunk Lookup Table

Following the header, a table of contents lists all chunks:
For each of (C + 1) entries:
  4 bytes - Chunk ID (0x00000000 terminates)
  8 bytes - Offset from file start
Chunks are stored contiguously, so chunk length can be inferred from the next chunk’s offset.

Chunk Types

OID Fanout (OIDF) - Required

Size: 256 × 4 bytes = 1024 bytes The fanout table enables binary search by first byte:
F[i] = count of commits with first byte ≤ i
F[255] = total number of commits (N)
def find_commit_position(fanout, oid_first_byte):
    if oid_first_byte == 0:
        start = 0
    else:
        start = fanout[oid_first_byte - 1]
    end = fanout[oid_first_byte]
    
    return (start, end)  # Search range for binary search

OID Lookup (OIDL) - Required

Size: N × H bytes (H = 20 for SHA-1, 32 for SHA-256) Commit OIDs stored in lexicographic order:
Commit 0: <20 or 32 byte OID>
Commit 1: <20 or 32 byte OID>
...
Commit N-1: <20 or 32 byte OID>

Commit Data (CDAT) - Required

Size: N × (H + 16) bytes For each commit:
H bytes  - Root tree OID
8 bytes  - Parent data:
           [0-3]: First parent position (0x70000000 if none)
           [4-7]: Second parent position OR
                  MSB=1 + offset into Extra Edge List
8 bytes  - Generation and timestamp:
           [0-3]: Bits 31-2 = topological level (gen v1)
                  Bits 1-0 = commit time bits 33-34
           [4-7]: Commit time bits 31-0 (seconds since epoch)
  • 0x70000000 = No parent (root commit)
  • 0x00000000 - 0x6FFFFFFF = Direct position reference
  • 0x80000000+ = MSB set, remaining bits point to Extra Edge List
This encoding supports up to ~1.8 billion commits per graph.

Generation Data (GDA2) - Optional

Size: N × 4 bytes Stores corrected commit date offsets (generation number v2):
For each commit:
  If offset fits in 31 bits:
    4 bytes - Offset value (MSB = 0)
  Otherwise:
    4 bytes - MSB=1, remaining bits = index into GDO2 chunk
The chunk ID includes “2” because previous versions (“GDAT”) had erroneous data. Modern Git ignores old chunks.

Generation Data Overflow (GDO2) - Optional

Size: Variable (M × 8 bytes, where M = number of overflows) Stores 64-bit corrected commit dates that don’t fit in 31 bits:
8 bytes - Full corrected commit date offset
8 bytes - Next overflow value
...

Extra Edge List (EDGE) - Optional

Stores additional parents for octopus merges (>2 parents):
4 bytes - Third parent position
4 bytes - Fourth parent position (MSB set if last)
4 bytes - Fifth parent position (MSB set if last)
...
The last parent entry has MSB=1 to signal termination.

Bloom Filter Index (BIDX) - Optional

Size: N × 4 bytes
BIDX[i] = cumulative byte count of Bloom filters 0 through i
Bloom filter for commit i spans bytes BIDX[i-1] to BIDX[i].

Bloom Filter Data (BDAT) - Optional

Header:
4 bytes - Hash algorithm version (2 = murmur3)
4 bytes - Number of hash functions  
4 bytes - Bits per entry
Data: Concatenated Bloom filters for each commit, storing changed paths.
Commits with no changes or >512 changes have length-1 filters (all zeros or all ones).

Base Graphs List (BASE) - Optional

Size: B × H bytes Hashes of commit-graph files in the chain (for split graphs):
Hash of base layer 0
Hash of base layer 1
...
Hash of base layer B-1

Trailer

H bytes - Hash checksum of all preceding data

Generation Numbers

Git supports two generation number schemes:

Generation Number v1 (Topological Level)

Defined recursively:
  • Root commits: level = 1
  • Other commits: level = 1 + max(parent levels)
def compute_topological_level(commit):
    if not commit.parents:
        return 1
    return 1 + max(parent.level for parent in commit.parents)

Generation Number v2 (Corrected Commit Date)

More accurate than topological level:
  • Root commits: corrected date = commit date (or 1 if date is 0)
  • Other commits: corrected date = max(commit date, 1 + max(parent corrected dates))
def compute_corrected_commit_date(commit):
    if not commit.parents:
        return max(commit.date, 1)
    
    max_parent_date = max(parent.corrected_date for parent in commit.parents)
    return max(commit.date, max_parent_date + 1)
Key Invariant: If commits A and B have generation numbers N and M where N ≤ M, then A cannot reach B.This enables early termination in graph walks without exploring unreachable commits.

Commit Graph Chains

For incrementally growing repositories, Git uses chained commit-graph files:

File Layout

.git/objects/info/commit-graphs/
├── commit-graph-chain          # Plain text list of hashes
├── graph-{hash0}.graph         # Base layer (oldest)
├── graph-{hash1}.graph         # Layer 1  
└── graph-{hash2}.graph         # Top layer (newest)
commit-graph-chain contents:
{hash0}
{hash1}
{hash2}

Position Calculation

Given:
  • X0 commits in base layer
  • X1 commits in layer 1
  • X2 commits in layer 2
Commit at position i in layer 2 has global position: X0 + X1 + i
class CommitGraphChain:
    def __init__(self, layers):
        self.layers = layers
        self.offsets = [0]
        for layer in layers:
            self.offsets.append(self.offsets[-1] + layer.num_commits)
    
    def find_commit(self, position):
        for i, offset in enumerate(self.offsets[1:]):
            if position < offset:
                local_pos = position - self.offsets[i]
                return self.layers[i].get_commit(local_pos)

Merge Strategy

New layers are merged to maintain logarithmic chain length: Condition 1: If commits in layer N < X × commits in layer N+1 (default X=2) Condition 2: If commits in layer N+1 > C (default C=64,000) When triggered, layers are consolidated:
Before:                      After:
┌──────────────┐            
│ New commits  │            ┌─────────────────┐
├──────────────┤            │  Merged layer   │
│ graph-hash2  │     →      │  graph-hash3    │
├──────────────┤            └─────────────────┘
│ graph-hash1  │                    │
├──────────────┤            ┌──────────────────┐
│ graph-hash0  │            │  graph-hash0     │
└──────────────┘            └──────────────────┘
This strategy bounds the chain length to O(log N) where N is total commits.

Mixed Generation Chains

If an old Git writes a new layer without generation v2:
Layer 2 (new, topological only)  ← Git falls back to v1 for all layers
Layer 1 (old, has corrected dates)
Layer 0 (old, has corrected dates)
When the topmost layer lacks generation v2 data, Git uses only topological levels for the entire chain to maintain consistency.

Working with Commit Graphs

Verifying a Commit Graph

import hashlib
import struct

def verify_commit_graph(filepath):
    with open(filepath, 'rb') as f:
        data = f.read()
        
    # Extract hash length from header
    hash_version = data[5]
    hash_len = 20 if hash_version == 1 else 32
    
    # Split content and checksum
    content = data[:-hash_len]
    stored_hash = data[-hash_len:]
    
    # Compute checksum
    if hash_version == 1:
        computed = hashlib.sha1(content).digest()
    else:
        computed = hashlib.sha256(content).digest()
    
    if computed == stored_hash:
        print("✓ Commit graph checksum valid")
        return True
    else:
        print("✗ Checksum mismatch")
        return False

Performance Impact

git log --graph

50-100x faster in repositories with millions of commits

git merge-base

100-200x faster by using generation numbers to prune search

git status

10-20x faster when checking ahead/behind relationship

Configuration

# Enable commit-graph globally
git config --global core.commitGraph true

# Write commit-graph after fetch
git config --global fetch.writeCommitGraph true

# Generate commit-graph with Bloom filters
git commit-graph write --reachable --changed-paths

# Write split commit-graph
git commit-graph write --split --reachable

Limitations

  • Grafts and replace objects: Commit-graph is disabled when these are present
  • Shallow clones: Generation numbers become invalid when commits are unshallowed
  • Maximum commits: Limited to ~1.8 billion commits per graph