You are viewing a preview of this lesson. Sign in to start learning
Back to Hermetic Builds

Common Non-Hermetic Patterns

Identify the hidden dependencies and anti-patterns that break hermeticity in typical build systems.

Common Non-Hermetic Patterns

Understand the pitfalls that break build reproducibility with free flashcards and interactive examples. This lesson covers hidden dependencies, non-deterministic behavior, timestamp coupling, and network accessβ€”common patterns that compromise hermetic builds and how to identify them in your build systems.


Welcome 🎯

Welcome to one of the most practical lessons in hermetic builds! While understanding the theory is important, recognizing anti-patterns in the wild is what separates build engineers who struggle with "works on my machine" problems from those who create truly reproducible systems.

Think of non-hermetic patterns as build system landmines πŸ’£. They might not explode immediatelyβ€”your build could work perfectly for weeks or monthsβ€”but eventually, something changes (a new developer joins, the build server gets upgraded, or you try to reproduce an old release), and suddenly everything breaks.

In this lesson, you'll learn to spot these landmines before they detonate. We'll examine real-world examples, understand why each pattern is problematic, and explore how to refactor them into hermetic alternatives.


Core Concepts: The Four Horsemen of Non-Hermetic Builds 🐴

Non-hermetic patterns fall into four main categories, each representing a different way builds become environment-dependent instead of input-dependent:

1. Hidden Dependencies πŸ“¦πŸ”

Hidden dependencies are inputs your build relies on that aren't explicitly declared. This is perhaps the most common anti-pattern.

Classic examples:

Pattern What It Does Why It Breaks
npm install without lockfile Fetches latest compatible versions Dependency versions change over time
Reading ~/.config files Uses user-specific configuration Different users get different results
Importing system Python packages Uses globally installed libraries System packages vary across machines
Relying on $PATH tools Uses whatever version is found first Tool versions differ by environment

πŸ’‘ Tip: If you can't reproduce a build by only providing the source code and declared dependencies, you have hidden dependencies.

Deep dive example:

## ❌ NON-HERMETIC: Hidden dependency on system tools
import subprocess

def compress_assets():
    # Relies on whatever 'convert' is in PATH
    subprocess.run(['convert', 'logo.png', '-resize', '50%', 'logo-small.png'])
    # ImageMagick version affects output!

This code assumes convert (ImageMagick) is installed, but:

  • Different versions produce different output files
  • Tool might not exist on some systems
  • No way to know which version was used for a given build

Hermetic alternative:

## βœ… HERMETIC: Explicit dependency with version pinning
## BUILD file declares: py_binary(
##   deps = ["@imagemagick_7.1.0//..."]
## )

import subprocess
import os

def compress_assets(imagemagick_path):
    # Use explicitly versioned tool from declared dependency
    convert_bin = os.path.join(imagemagick_path, 'convert')
    subprocess.run([convert_bin, 'logo.png', '-resize', '50%', 'logo-small.png'])

2. Non-Deterministic Behavior 🎲

Non-deterministic builds produce different outputs even with identical inputs. This violates the fundamental principle of hermeticity.

Common sources of non-determinism:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚        SOURCES OF NON-DETERMINISM               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                 β”‚
β”‚  πŸ• Timestamps/Dates  β†’  Time-based values     β”‚
β”‚                          Current date/time      β”‚
β”‚                                                 β”‚
β”‚  πŸ”€ Unordered Maps    β†’  Iteration order       β”‚
β”‚                          Hash table traversal   β”‚
β”‚                                                 β”‚
β”‚  🎰 Random Numbers    β†’  UUID generation        β”‚
β”‚                          Crypto without seed    β”‚
β”‚                                                 β”‚
β”‚  🧡 Race Conditions   β†’  Parallel execution     β”‚
β”‚                          Thread scheduling      β”‚
β”‚                                                 β”‚
β”‚  🌐 Network Responses β†’  API calls              β”‚
β”‚                          Download timing        β”‚
β”‚                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Example: Hash map iteration

// ❌ NON-HERMETIC: Output order depends on hash implementation
const dependencies = {
  'react': '18.2.0',
  'lodash': '4.17.21',
  'axios': '1.3.4'
};

let output = '';
for (const [pkg, version] of Object.entries(dependencies)) {
  output += `${pkg}@${version}\n`;
}
fs.writeFileSync('deps.txt', output);
// Order varies between Node.js versions!

Hermetic fix:

// βœ… HERMETIC: Deterministic output with sorted keys
const dependencies = {
  'react': '18.2.0',
  'lodash': '4.17.21',
  'axios': '1.3.4'
};

const sortedKeys = Object.keys(dependencies).sort();
let output = '';
for (const pkg of sortedKeys) {
  output += `${pkg}@${dependencies[pkg]}\n`;
}
fs.writeFileSync('deps.txt', output);
// Always: axios, lodash, react

3. Timestamp Coupling ⏰

Timestamp coupling occurs when builds embed or depend on current time values. This seems harmless but creates subtle reproducibility issues.

Why timestamps break hermeticity:

  1. Rebuild detection fails: Build systems use timestamps to detect changes
  2. Binary comparison impossible: Two identical builds differ only by timestamp
  3. Debugging nightmares: Can't reproduce exact output from historical commit

Common timestamp anti-patterns:

## ❌ Embedding build date in binary
BUILD_DATE = datetime.now().strftime('%Y-%m-%d')
version_string = f"v1.2.3 (built {BUILD_DATE})"

## ❌ Using timestamps in filenames
output_file = f"report_{int(time.time())}.pdf"

## ❌ Timestamp-based cache keys
cache_key = f"{source_hash}_{os.path.getmtime('config.yml')}"

Hermetic alternatives:

## βœ… Use VCS commit information instead
import subprocess

def get_build_info():
    commit = subprocess.check_output(
        ['git', 'rev-parse', '--short', 'HEAD']
    ).decode().strip()
    
    # If you need a date, use commit date (deterministic)
    commit_date = subprocess.check_output(
        ['git', 'show', '-s', '--format=%ci', 'HEAD']
    ).decode().strip()
    
    return f"v1.2.3 (commit {commit})"

## βœ… Use content hash for filenames
import hashlib

with open('report_data.json', 'rb') as f:
    content_hash = hashlib.sha256(f.read()).hexdigest()[:8]
output_file = f"report_{content_hash}.pdf"

## βœ… Use content hash for cache keys
with open('config.yml', 'rb') as f:
    config_hash = hashlib.sha256(f.read()).hexdigest()
cache_key = f"{source_hash}_{config_hash}"

πŸ’‘ Pro tip: Set the SOURCE_DATE_EPOCH environment variable to make timestamp-using tools deterministic. Many modern tools respect this Unix convention.

4. Network Access During Build 🌐

Network access during builds introduces external dependencies that can change without warning.

Why network access is problematic:

Issue Impact Real-World Scenario
πŸ”„ Content changes Different results over time API returns updated data
⚠️ Service outages Build failures Package registry is down
🐌 Network latency Slow, unreliable builds Downloading GBs repeatedly
πŸ”’ Offline impossible Can't build without internet Airplane coding fails
🎯 Version mismatch "Latest" changes Pulling docker:latest tag

Network anti-patterns:

## ❌ Dockerfile with network dependency
FROM ubuntu:latest
## "latest" changes over time!

RUN apt-get update && apt-get install -y python3
## Installs whatever version is current

RUN curl -o data.json https://api.example.com/data
## API response changes
## ❌ Makefile fetching during build
build:
	wget https://cdn.example.com/library.js
	cat library.js main.js > bundle.js
	# Content of library.js can change

Hermetic approach: Content-Addressed Storage

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     HERMETIC DEPENDENCY MANAGEMENT             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

   πŸ“¦ Package Registry (npmjs.com)
          β”‚
          ↓ (one-time fetch)
   πŸ“₯ Local Cache/Mirror
          β”‚ (content hash verified)
          ↓
   πŸ—οΈ  Build Process
          β”‚ (reads from cache only)
          ↓
   βœ… Output Artifact

Key principle: Fetch BEFORE build phase
## βœ… Hermetic Dockerfile
FROM ubuntu@sha256:82becede498899ec668628e7cb0ad87b6e1c371cb8a1e597d83a47fac21d6af3
## Exact image by content hash

COPY requirements.txt .
RUN pip install --no-index --find-links=/var/cache/pip -r requirements.txt
## Uses pre-fetched packages from cache

COPY data.json .
## Data checked into source control

🧠 Memory device: Think "Network = No reproducibility". If your build makes network calls, it's not hermetic.


Example 1: The Package Manager Trap πŸ“¦

Scenario: You're building a Node.js application. Your package.json specifies:

{
  "dependencies": {
    "express": "^4.17.0",
    "lodash": "~4.17.20"
  }
}

What's wrong?

The ^ and ~ symbols allow range matching:

  • ^4.17.0 matches any 4.x.x where x >= 17
  • ~4.17.20 matches any 4.17.x where x >= 20

Running npm install today might fetch:

  • express@4.18.2
  • lodash@4.17.21

Running it next month might fetch:

  • express@4.19.0 (new release!)
  • lodash@4.17.21

Why it breaks:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  TIMELINE: Package Version Drift             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                              β”‚
β”‚  Day 1: Developer A                          β”‚
β”‚  npm install β†’ express@4.18.2 βœ…            β”‚
β”‚  Build works!                                β”‚
β”‚            β”‚                                 β”‚
β”‚            ↓                                 β”‚
β”‚  Day 15: Express releases 4.19.0             β”‚
β”‚          (with breaking change)              β”‚
β”‚            β”‚                                 β”‚
β”‚            ↓                                 β”‚
β”‚  Day 16: Developer B                         β”‚
β”‚  npm install β†’ express@4.19.0 ❌            β”‚
β”‚  Build fails! "Cannot find module..."        β”‚
β”‚                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Hermetic solution: Lockfiles

## Generate lockfile (one time)
npm install
## Creates package-lock.json with EXACT versions

## Commit lockfile
git add package-lock.json
git commit -m "Add lockfile for hermetic builds"

## Future builds use exact versions
npm ci  # NOT 'npm install'!

The package-lock.json contains:

{
  "dependencies": {
    "express": {
      "version": "4.18.2",
      "resolved": "https://registry.npmjs.org/express/-/express-4.18.2.tgz",
      "integrity": "sha512-5/PsL6iGPdfQ/lKM1UuielYgv3BUoJfz1aUwU9vHZ+J7gyvwdQXFEBIEIaxeGf0GIcreATNyBExtalisDbuMqQ=="
    }
  }
}

Now every build uses 4.18.2 with verified content hash! 🎯


Example 2: The Environment Variable Backdoor πŸšͺ

Scenario: Your Python build script adapts based on environment:

## build.py
import os
import sys

def build():
    # ❌ NON-HERMETIC: Behavior changes based on environment
    debug_mode = os.getenv('DEBUG', 'false').lower() == 'true'
    
    if debug_mode:
        optimization_level = '-O0'
        include_symbols = True
    else:
        optimization_level = '-O3'
        include_symbols = False
    
    # Build with different settings!
    compile_code(optimization_level, include_symbols)
    
    # ❌ Even worse: Optional features
    if os.getenv('ENABLE_EXPERIMENTAL'):
        enable_experimental_features()
    
    # ❌ Catastrophic: Different output paths
    output_dir = os.getenv('OUTPUT_DIR', './dist')
    copy_artifacts(output_dir)

What's wrong?

  1. Two developers with different $DEBUG settings get different binaries
  2. CI server might have different environment than local machines
  3. No way to know what environment produced a given artifact

Real-world impact:

Developer Environment Binary Produced Result
Alice DEBUG=true Unoptimized, 50MB Works, but slow
Bob (no DEBUG set) Optimized, 5MB Works, fast
CI Server DEBUG=false explicitly Optimized, 5MB Works, fast
Carol DEBUG=1 Optimized (1 β‰  'true'!) Unexpected behavior!

Hermetic solution: Explicit build modes

## build.py
import argparse

def build(mode: str, output_dir: str):
    """Build with explicit, declared parameters.
    
    Args:
        mode: Must be 'debug' or 'release'
        output_dir: Explicit output location
    """
    # βœ… HERMETIC: All inputs are explicit arguments
    
    if mode == 'debug':
        optimization_level = '-O0'
        include_symbols = True
    elif mode == 'release':
        optimization_level = '-O3'
        include_symbols = False
    else:
        raise ValueError(f"Invalid mode: {mode}")
    
    compile_code(optimization_level, include_symbols)
    copy_artifacts(output_dir)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--mode', required=True, choices=['debug', 'release'])
    parser.add_argument('--output-dir', required=True)
    args = parser.parse_args()
    
    build(args.mode, args.output_dir)

Now build behavior is determined by explicit arguments, not ambient environment:

## Everyone runs the same command for release builds
python build.py --mode=release --output-dir=./dist

## Debug builds are equally explicit
python build.py --mode=debug --output-dir=./dist-debug

πŸ’‘ Best practice: Environment variables are OK for non-semantic settings (like TMPDIR or HOME), but never for build logic!


Example 3: The Timestamp in Archive Problem πŸ“β°

Scenario: You're creating a release tarball:

#!/bin/bash
## release.sh

## ❌ NON-HERMETIC: Embedded timestamps
tar -czf release.tar.gz src/
## Tar includes modification times of all files!

## Even worse:
tar -czf "release-$(date +%Y%m%d).tar.gz" src/
## Filename changes every day

What's wrong?

Even if source code is identical, the tarball will be different because:

  1. File modification times differ
  2. Tarball creation time is recorded
  3. Filename includes current date

Demonstration:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  IDENTICAL SOURCE, DIFFERENT TARBALLS          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                β”‚
β”‚  Build 1 (Jan 15):                             β”‚
β”‚    $ tar -czf release.tar.gz src/              β”‚
β”‚    $ sha256sum release.tar.gz                  β”‚
β”‚    a3f5c2... release.tar.gz                    β”‚
β”‚                                                β”‚
β”‚  Build 2 (Jan 16, same code):                  β”‚
β”‚    $ tar -czf release.tar.gz src/              β”‚
β”‚    $ sha256sum release.tar.gz                  β”‚
β”‚    d8e1f9... release.tar.gz  ← DIFFERENT!      β”‚
β”‚                                                β”‚
β”‚  Problem: Can't verify bit-for-bit             β”‚
β”‚  reproducibility                               β”‚
β”‚                                                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Hermetic solution: Reproducible archives

#!/bin/bash
## release.sh - Hermetic version

## βœ… Set deterministic timestamps
## Use SOURCE_DATE_EPOCH (Unix timestamp)
## Typically set to last commit time
export SOURCE_DATE_EPOCH=$(git log -1 --format=%ct)

## βœ… Sort files deterministically
find src/ -type f | sort > files.list

## βœ… Create archive with fixed timestamp
tar \
  --sort=name \
  --mtime="@${SOURCE_DATE_EPOCH}" \
  --owner=0 --group=0 --numeric-owner \
  --pax-option=exthdr.name=%d/PaxHeaders/%f,delete=atime,delete=ctime \
  -czf release.tar.gz \
  -T files.list

## βœ… Use content-based naming
HASH=$(sha256sum release.tar.gz | cut -d' ' -f1 | head -c 8)
mv release.tar.gz "release-${HASH}.tar.gz"

echo "Created reproducible archive: release-${HASH}.tar.gz"

Now the archive is deterministic:

  • Same source code β†’ same tarball
  • Same checksum every time
  • Can verify releases cryptographically

🧠 Memory device: "Timestamps Are Random" - tar without flags is non-hermetic!


Example 4: The Code Generation Randomness 🎲

Scenario: Your build includes code generation from a schema:

## generate_api.py
import json
import uuid

def generate_client(schema_file):
    with open(schema_file) as f:
        schema = json.load(f)
    
    code = ['// Auto-generated API client\n']
    
    # ❌ NON-HERMETIC: Random IDs
    for endpoint in schema['endpoints']:
        endpoint_id = str(uuid.uuid4())
        code.append(f'const {endpoint["name"]}_ID = "{endpoint_id}";\n')
    
    # ❌ NON-HERMETIC: Unordered dict iteration
    for name, params in schema['types'].items():
        code.append(f'interface {name} {{\n')
        for param in params:
            code.append(f'  {param["name"]}: {param["type"]};\n')
        code.append('}\n')
    
    return ''.join(code)

What's wrong?

  1. uuid.uuid4() generates random UUIDs each time
  2. Dictionary iteration order is undefined (Python <3.7) or implementation-dependent
  3. Generated code differs on every run, even with identical input

Impact on build system:

Run 1:
  generate_api.py β†’ client.ts (hash: abc123)
  compile client.ts β†’ client.js
  
Run 2 (no changes!):
  generate_api.py β†’ client.ts (hash: def456)  ← Changed!
  compile client.ts β†’ client.js  ← Unnecessary rebuild!
  
Cascading effect:
  - Everything depending on client.js rebuilds
  - Incremental builds don't work
  - Cache is useless

Hermetic solution: Deterministic generation

## generate_api.py - Hermetic version
import json
import hashlib

def generate_stable_id(name: str, schema_hash: str) -> str:
    """Generate deterministic ID from inputs."""
    # βœ… Same inputs always produce same ID
    content = f"{name}:{schema_hash}".encode('utf-8')
    return hashlib.sha256(content).hexdigest()[:16]

def generate_client(schema_file):
    with open(schema_file) as f:
        schema = json.load(f)
    
    # βœ… Compute schema hash for deterministic IDs
    schema_json = json.dumps(schema, sort_keys=True)
    schema_hash = hashlib.sha256(schema_json.encode()).hexdigest()
    
    code = ['// Auto-generated API client\n']
    code.append(f'// Schema hash: {schema_hash}\n\n')
    
    # βœ… Sort endpoints for deterministic order
    for endpoint in sorted(schema['endpoints'], key=lambda e: e['name']):
        endpoint_id = generate_stable_id(endpoint['name'], schema_hash)
        code.append(f'const {endpoint["name"]}_ID = "{endpoint_id}";\n')
    
    # βœ… Sort types and parameters
    for name in sorted(schema['types'].keys()):
        params = schema['types'][name]
        code.append(f'interface {name} {{\n')
        for param in sorted(params, key=lambda p: p['name']):
            code.append(f'  {param["name"]}: {param["type"]};\n')
        code.append('}\n\n')
    
    return ''.join(code)

Now generated code is perfectly reproducible:

  • Same schema β†’ same code, every time
  • IDs are stable but unique per endpoint
  • Build cache works correctly
  • Can verify generated code in code review

πŸ’‘ Pro tip: For any code generation, include a hash of the input schema in the output as a comment. This makes it easy to verify determinism!


Common Mistakes ⚠️

Mistake 1: "It Works on My Machine" Syndrome πŸ’»

The problem: Developers test builds only on their own machines, missing environment dependencies.

## Seems fine locally...
import cv2  # OpenCV installed via brew

def process_image(path):
    img = cv2.imread(path)
    return cv2.resize(img, (640, 480))

## Works! Ship it! πŸš€

What happens: CI fails because OpenCV isn't installed, or a different version exists.

Fix: Test in clean environments regularly:

## Use Docker to simulate clean environment
docker run --rm -v $(pwd):/src python:3.11 /src/build.sh

## Or use virtual environments
python -m venv clean_env
source clean_env/bin/activate
pip install -r requirements.txt  # Only declared deps!
./build.sh

Mistake 2: "Timestamp Doesn't Matter" ❌

The problem: Assuming timestamps only affect metadata, not functionality.

## "Just metadata, right?"
BUILD_INFO = f"Built on {datetime.now().isoformat()}"
print(BUILD_INFO)  # Goes to stdout
## Stdout becomes part of build log
## Build log checksum changes
## Cache invalidated!

What actually happens: Timestamps cascade through the build system, invalidating caches and breaking reproducibility verification.

Fix: Never embed timestamps unless explicitly required (and then use SOURCE_DATE_EPOCH).

Mistake 3: "Optional Dependencies Are Fine" 🎭

The problem: Making dependencies optional leads to inconsistent builds.

try:
    import numpy as np
    HAS_NUMPY = True
except ImportError:
    HAS_NUMPY = False

def process_data(data):
    if HAS_NUMPY:
        return np.array(data).mean()  # Fast path
    else:
        return sum(data) / len(data)  # Slow path

What's wrong: Two builds with identical source produce different behavior (and performance!) based on whether numpy is installed.

Fix: Make all dependencies explicit and required:

import numpy as np  # Hard requirement in requirements.txt

def process_data(data):
    return np.array(data).mean()

Mistake 4: "Build Scripts Don't Need Versioning" πŸ“œ

The problem: Updating build scripts without versioning them.

## build.sh (updated on June 1st)
#!/bin/bash
gcc -O2 main.c -o app  # Changed from -O3!

What happens: Old commits can't be rebuilt with original flags, making historical releases unreproducible.

Fix: Version build scripts with source code:

## Commit build script changes
git add build.sh
git commit -m "build: reduce optimization to -O2 for faster builds"

## Now old commits checkout old build.sh automatically!

Mistake 5: "Caching Speeds Up Builds" (Incorrectly) 🏎️

The problem: Aggressive caching that ignores hidden dependencies.

## cache.py
import os
import pickle

def get_cached_or_build(source_file, build_func):
    cache_file = f"{source_file}.cache"
    
    # ❌ Only checks source file timestamp!
    if os.path.exists(cache_file):
        if os.path.getmtime(cache_file) > os.path.getmtime(source_file):
            with open(cache_file, 'rb') as f:
                return pickle.load(f)
    
    result = build_func(source_file)
    with open(cache_file, 'wb') as f:
        pickle.dump(result, f)
    return result

What's wrong: Misses changes to:

  • Tools used by build_func
  • Configuration files
  • Environment variables
  • Dependencies of source_file

Fix: Use content-addressed caching:

import hashlib
import json

def get_cached_or_build(source_file, build_func, dependencies):
    # βœ… Hash ALL inputs
    hasher = hashlib.sha256()
    
    # Source content
    with open(source_file, 'rb') as f:
        hasher.update(f.read())
    
    # All dependencies
    for dep in dependencies:
        with open(dep, 'rb') as f:
            hasher.update(f.read())
    
    # Tool versions
    hasher.update(get_tool_version('gcc').encode())
    
    cache_key = hasher.hexdigest()
    cache_file = f".cache/{cache_key}"
    
    if os.path.exists(cache_file):
        with open(cache_file, 'rb') as f:
            return pickle.load(f)
    
    result = build_func(source_file)
    os.makedirs('.cache', exist_ok=True)
    with open(cache_file, 'wb') as f:
        pickle.dump(result, f)
    return result

Key Takeaways πŸŽ“

πŸ“‹ Non-Hermetic Pattern Checklist

Pattern Detection Fix
Hidden Dependencies Build fails on clean system Declare all dependencies explicitly
Non-Determinism Different outputs from same inputs Sort collections, fix seeds, avoid timestamps
Timestamp Coupling Artifacts differ only by timestamp Use SOURCE_DATE_EPOCH, content hashing
Network Access Build fails offline Pre-fetch to cache, use lockfiles
Environment Variables Different behavior per developer Use explicit arguments
Optional Dependencies Inconsistent feature availability Make all dependencies required

The Hermetic Build Test πŸ§ͺ

Your build is hermetic if:

  1. βœ… Clean environment test: Runs in fresh Docker container with only declared dependencies
  2. βœ… Bit-for-bit reproducibility: Two builds produce identical output (same checksum)
  3. βœ… Offline test: Works without network access (after initial dependency fetch)
  4. βœ… Time travel test: Can rebuild old commits with original tools (versioned toolchain)
  5. βœ… Platform independence: Produces same output on Linux, macOS, Windows (or explicitly platform-specific)

Quick Reference: Hermetic Alternatives

Anti-Pattern Hermetic Alternative
npm install npm ci (with lockfile)
datetime.now() os.environ['SOURCE_DATE_EPOCH']
uuid.uuid4() hashlib.sha256(content).hexdigest()
for key in dict: for key in sorted(dict):
wget url Pre-fetch to cache, verify checksum
os.getenv('FLAG') argparse with required arguments
System Python packages Virtual environment + requirements.txt
Tools from $PATH Versioned tools in project

The Golden Rule πŸ†

If it's not in the source repository and not explicitly declared as a dependency with a version/hash, it doesn't exist.

Treat any other input as a bug waiting to happen.


πŸ“š Further Study

  1. Reproducible Builds Project - https://reproducible-builds.org/ - Comprehensive guide to achieving bit-for-bit reproducibility across different systems
  2. Bazel Build Encyclopedia - https://bazel.build/concepts/build-ref - Learn how Google's build system enforces hermeticity at scale
  3. Docker Multi-Stage Builds - https://docs.docker.com/build/building/multi-stage/ - Practical techniques for hermetic containerized builds with dependency isolation

Congratulations! πŸŽ‰ You can now identify and fix non-hermetic patterns in build systems. Remember: hermetic builds aren't just theoretical purityβ€”they save real debugging time and make teams more productive. The upfront investment in fixing these patterns pays dividends every single day.

Next up: We'll explore tools and frameworks that enforce hermeticity automatically, so you don't have to remember all these rules manually!