Skip to main content

Overview

The Git index (also called the staging area or cache) is stored as a binary file at $GIT_DIR/index. It serves as a critical data structure that tracks the state of files in the working directory and acts as a bridge between the working tree and the repository’s object database.
The index uses network byte order for all binary numbers and checksums are computed using the repository’s configured hash algorithm (SHA-1 or SHA-256).

File Structure

The index file consists of three main sections:
  1. Header - Contains signature and metadata
  2. Index Entries - Sorted list of tracked files
  3. Extensions - Optional data structures for performance optimization

Header Format

The index begins with a 12-byte header:
4 bytes - Signature: 'D', 'I', 'R', 'C' ("dircache")
4 bytes - Version number (2, 3, or 4)
4 bytes - Number of index entries
# Reading index header
import struct

with open('.git/index', 'rb') as f:
    signature = f.read(4)
    version = struct.unpack('>I', f.read(4))[0]
    num_entries = struct.unpack('>I', f.read(4))[0]
    
    print(f"Signature: {signature.decode()}")
    print(f"Version: {version}")
    print(f"Entries: {num_entries}")

Index Entry Format

Each index entry represents a file and contains metadata from stat(2) plus Git-specific information:

Entry Structure

32-bit ctime seconds
32-bit ctime nanosecond fractions
32-bit mtime seconds  
32-bit mtime nanosecond fractions
32-bit dev
32-bit ino
32-bit mode
32-bit uid
32-bit gid
32-bit file size (truncated)
Object ID (20 or 32 bytes depending on hash)
16-bit flags
Variable-length path name
1-8 NUL bytes for padding (versions 2-3)

Mode Field Breakdown

The 32-bit mode field is structured as:
16 bits - Unused (must be zero)
4 bits  - Object type:
          1000 (0x8) = Regular file
          1010 (0xA) = Symbolic link  
          1110 (0xE) = Gitlink (submodule)
3 bits  - Unused (must be zero)
9 bits  - Unix permissions (only 0755 and 0644 valid for regular files)

Flags Field

The 16-bit flags field contains:
1 bit  - assume-valid flag
1 bit  - extended flag (must be 0 in version 2)
2 bits - stage (0 = normal, 1-3 = merge conflicts)
12 bits - name length (0xFFF if length >= 0xFFF)
Version 3+: When the extended flag is set, an additional 16-bit field follows with skip-worktree and intent-to-add flags.

Version 4 Optimizations

Version 4 introduces two key optimizations:

Path Compression

Path names are prefix-compressed relative to the previous entry:
Entry 1: "src/main/app.js"
Entry 2: [N=9] "utils/helper.js"  → "src/main/utils/helper.js"
The integer N indicates how many bytes to remove from the previous path before appending the new suffix.

No Padding

Unlike versions 2-3, version 4 does not pad entries to 8-byte boundaries, resulting in smaller index files.

Index Extensions

Extensions enable additional functionality without breaking compatibility:

Extension Format

4 bytes - Extension signature (optional if first byte is 'A'-'Z')
4 bytes - Extension size
N bytes - Extension data

Cache Tree (TREE)

Stores pre-computed tree objects for unchanged directories:
NUL-terminated path component
ASCII decimal entry count
Space (0x20)
ASCII decimal subtree count  
Newline (0x0A)
Object ID (if valid, -1 means invalid)
  • Speeds up git commit by reusing existing tree objects
  • Improves git status performance when comparing against HEAD
  • Reduces object database writes for incremental commits

Resolve Undo (REUC)

Preserves pre-resolution merge conflict state:
For each conflict:
  - NUL-terminated pathname
  - Three NUL-terminated octal mode strings (stages 1-3)
  - Up to three object IDs (missing stages omitted)
Enables git checkout -m to recreate conflicts. Shares most index data via a base index file:
Hash of shared index (stored at .git/sharedindex.<hash>)
EWAH-encoded delete bitmap
EWAH-encoded replace bitmap
Replacement entries
Added entries

Untracked Cache (UNTR)

Caches untracked file information:
Environment validation strings
Stat data for $GIT_DIR/info/exclude
Stat data for core.excludesFile
32-bit dir_flags
Hash of exclude files
Directory tree structure with untracked entries
The untracked cache is invalidated if exclude files or environment variables change.

File System Monitor (FSMN)

Integrates with filesystem watching tools: Version 1:
32-bit version (1)
64-bit nanoseconds since epoch
32-bit bitmap size
EWAH bitmap of non-valid entries
Version 2:
32-bit version (2)
NUL-terminated opaque token
32-bit bitmap size  
EWAH bitmap of non-valid entries

End of Index Entry (EOIE)

Enables fast extension location:
32-bit offset to end of index entries
Hash over extension types and sizes
EOIE must be written last since it must be loadable before parsing entries.

Index Entry Offset Table (IEOT)

Enables multi-threaded index loading:
32-bit version (1)
For each block:
  - 32-bit offset from file start
  - 32-bit count of entries in block

Sparse Directory Entries

When using sparse-checkout in cone mode with extensions.sparseIndex:
Mode: 040000 (directory)
Flags: SKIP_WORKTREE bit set
Path: Ends with directory separator '/'
Index format versions 4 and earlier include the sdir extension signature to indicate sparse directory support.

Checksum

The index file ends with a hash checksum:
20 bytes (SHA-1) or 32 bytes (SHA-256)
The checksum covers all content before it, ensuring index integrity.

Sorting and Ordering

Index entries are sorted by:
  1. Primary: Path name as unsigned bytes (memcmp order)
  2. Secondary: Stage field (for merge conflicts)
int index_name_cmp(const char *name1, int len1,
                   const char *name2, int len2,
                   int stage1, int stage2)
{
    int cmp = memcmp(name1, name2, len1 < len2 ? len1 : len2);
    if (cmp)
        return cmp;
    if (len1 < len2)
        return -1;
    if (len1 > len2)
        return 1;
    return stage1 - stage2;
}

Working with the Index

Reading Index Entries

import struct
import hashlib

def read_index_entry(f, version):
    # Read fixed-size portion (62 bytes)
    ctime_s = struct.unpack('>I', f.read(4))[0]
    ctime_n = struct.unpack('>I', f.read(4))[0]
    mtime_s = struct.unpack('>I', f.read(4))[0]
    mtime_n = struct.unpack('>I', f.read(4))[0]
    dev = struct.unpack('>I', f.read(4))[0]
    ino = struct.unpack('>I', f.read(4))[0]
    mode = struct.unpack('>I', f.read(4))[0]
    uid = struct.unpack('>I', f.read(4))[0]
    gid = struct.unpack('>I', f.read(4))[0]
    size = struct.unpack('>I', f.read(4))[0]
    
    # Object ID (20 bytes for SHA-1)
    oid = f.read(20).hex()
    
    # Flags
    flags = struct.unpack('>H', f.read(2))[0]
    
    # Extract name length from flags
    name_len = flags & 0xFFF
    
    # Read path name
    if version == 4:
        # Version 4: variable-width encoding
        # (simplified - actual implementation more complex)
        path = read_path_v4(f)
    else:
        # Version 2/3: read until NUL
        path_bytes = b''
        while True:
            byte = f.read(1)
            if byte == b'\x00':
                break
            path_bytes += byte
        path = path_bytes.decode('utf-8')
        
        # Skip padding to 8-byte boundary
        entry_len = 62 + 20 + 2 + len(path_bytes) + 1
        padding = (8 - (entry_len % 8)) % 8
        f.read(padding)
    
    return {
        'path': path,
        'oid': oid,
        'mode': mode,
        'size': size,
        'mtime': (mtime_s, mtime_n)
    }

Performance Considerations

Version 4

Use version 4 for large repositories to reduce index size through path compression and eliminated padding.

Split Index

Enable split index mode for frequently changing repositories to avoid rewriting the entire index.

Untracked Cache

Enable untracked cache to speed up git status by avoiding filesystem scans.

FS Monitor

Integrate with fsmonitor for real-time tracking in very large working trees.