Common Non-Hermetic Patterns
Identify the hidden dependencies and anti-patterns that break hermeticity in typical build systems.
Common Non-Hermetic Patterns
Understand the pitfalls that break build reproducibility with free flashcards and interactive examples. This lesson covers hidden dependencies, non-deterministic behavior, timestamp coupling, and network accessβcommon patterns that compromise hermetic builds and how to identify them in your build systems.
Welcome π―
Welcome to one of the most practical lessons in hermetic builds! While understanding the theory is important, recognizing anti-patterns in the wild is what separates build engineers who struggle with "works on my machine" problems from those who create truly reproducible systems.
Think of non-hermetic patterns as build system landmines π£. They might not explode immediatelyβyour build could work perfectly for weeks or monthsβbut eventually, something changes (a new developer joins, the build server gets upgraded, or you try to reproduce an old release), and suddenly everything breaks.
In this lesson, you'll learn to spot these landmines before they detonate. We'll examine real-world examples, understand why each pattern is problematic, and explore how to refactor them into hermetic alternatives.
Core Concepts: The Four Horsemen of Non-Hermetic Builds π΄
Non-hermetic patterns fall into four main categories, each representing a different way builds become environment-dependent instead of input-dependent:
1. Hidden Dependencies π¦π
Hidden dependencies are inputs your build relies on that aren't explicitly declared. This is perhaps the most common anti-pattern.
Classic examples:
| Pattern | What It Does | Why It Breaks |
|---|---|---|
npm install without lockfile |
Fetches latest compatible versions | Dependency versions change over time |
Reading ~/.config files |
Uses user-specific configuration | Different users get different results |
| Importing system Python packages | Uses globally installed libraries | System packages vary across machines |
Relying on $PATH tools |
Uses whatever version is found first | Tool versions differ by environment |
π‘ Tip: If you can't reproduce a build by only providing the source code and declared dependencies, you have hidden dependencies.
Deep dive example:
## β NON-HERMETIC: Hidden dependency on system tools
import subprocess
def compress_assets():
# Relies on whatever 'convert' is in PATH
subprocess.run(['convert', 'logo.png', '-resize', '50%', 'logo-small.png'])
# ImageMagick version affects output!
This code assumes convert (ImageMagick) is installed, but:
- Different versions produce different output files
- Tool might not exist on some systems
- No way to know which version was used for a given build
Hermetic alternative:
## β
HERMETIC: Explicit dependency with version pinning
## BUILD file declares: py_binary(
## deps = ["@imagemagick_7.1.0//..."]
## )
import subprocess
import os
def compress_assets(imagemagick_path):
# Use explicitly versioned tool from declared dependency
convert_bin = os.path.join(imagemagick_path, 'convert')
subprocess.run([convert_bin, 'logo.png', '-resize', '50%', 'logo-small.png'])
2. Non-Deterministic Behavior π²
Non-deterministic builds produce different outputs even with identical inputs. This violates the fundamental principle of hermeticity.
Common sources of non-determinism:
βββββββββββββββββββββββββββββββββββββββββββββββββββ β SOURCES OF NON-DETERMINISM β βββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β π Timestamps/Dates β Time-based values β β Current date/time β β β β π Unordered Maps β Iteration order β β Hash table traversal β β β β π° Random Numbers β UUID generation β β Crypto without seed β β β β π§΅ Race Conditions β Parallel execution β β Thread scheduling β β β β π Network Responses β API calls β β Download timing β β β βββββββββββββββββββββββββββββββββββββββββββββββββββ
Example: Hash map iteration
// β NON-HERMETIC: Output order depends on hash implementation
const dependencies = {
'react': '18.2.0',
'lodash': '4.17.21',
'axios': '1.3.4'
};
let output = '';
for (const [pkg, version] of Object.entries(dependencies)) {
output += `${pkg}@${version}\n`;
}
fs.writeFileSync('deps.txt', output);
// Order varies between Node.js versions!
Hermetic fix:
// β
HERMETIC: Deterministic output with sorted keys
const dependencies = {
'react': '18.2.0',
'lodash': '4.17.21',
'axios': '1.3.4'
};
const sortedKeys = Object.keys(dependencies).sort();
let output = '';
for (const pkg of sortedKeys) {
output += `${pkg}@${dependencies[pkg]}\n`;
}
fs.writeFileSync('deps.txt', output);
// Always: axios, lodash, react
3. Timestamp Coupling β°
Timestamp coupling occurs when builds embed or depend on current time values. This seems harmless but creates subtle reproducibility issues.
Why timestamps break hermeticity:
- Rebuild detection fails: Build systems use timestamps to detect changes
- Binary comparison impossible: Two identical builds differ only by timestamp
- Debugging nightmares: Can't reproduce exact output from historical commit
Common timestamp anti-patterns:
## β Embedding build date in binary
BUILD_DATE = datetime.now().strftime('%Y-%m-%d')
version_string = f"v1.2.3 (built {BUILD_DATE})"
## β Using timestamps in filenames
output_file = f"report_{int(time.time())}.pdf"
## β Timestamp-based cache keys
cache_key = f"{source_hash}_{os.path.getmtime('config.yml')}"
Hermetic alternatives:
## β
Use VCS commit information instead
import subprocess
def get_build_info():
commit = subprocess.check_output(
['git', 'rev-parse', '--short', 'HEAD']
).decode().strip()
# If you need a date, use commit date (deterministic)
commit_date = subprocess.check_output(
['git', 'show', '-s', '--format=%ci', 'HEAD']
).decode().strip()
return f"v1.2.3 (commit {commit})"
## β
Use content hash for filenames
import hashlib
with open('report_data.json', 'rb') as f:
content_hash = hashlib.sha256(f.read()).hexdigest()[:8]
output_file = f"report_{content_hash}.pdf"
## β
Use content hash for cache keys
with open('config.yml', 'rb') as f:
config_hash = hashlib.sha256(f.read()).hexdigest()
cache_key = f"{source_hash}_{config_hash}"
π‘ Pro tip: Set the SOURCE_DATE_EPOCH environment variable to make timestamp-using tools deterministic. Many modern tools respect this Unix convention.
4. Network Access During Build π
Network access during builds introduces external dependencies that can change without warning.
Why network access is problematic:
| Issue | Impact | Real-World Scenario |
|---|---|---|
| π Content changes | Different results over time | API returns updated data |
| β οΈ Service outages | Build failures | Package registry is down |
| π Network latency | Slow, unreliable builds | Downloading GBs repeatedly |
| π Offline impossible | Can't build without internet | Airplane coding fails |
| π― Version mismatch | "Latest" changes | Pulling docker:latest tag |
Network anti-patterns:
## β Dockerfile with network dependency
FROM ubuntu:latest
## "latest" changes over time!
RUN apt-get update && apt-get install -y python3
## Installs whatever version is current
RUN curl -o data.json https://api.example.com/data
## API response changes
## β Makefile fetching during build
build:
wget https://cdn.example.com/library.js
cat library.js main.js > bundle.js
# Content of library.js can change
Hermetic approach: Content-Addressed Storage
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β HERMETIC DEPENDENCY MANAGEMENT β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
π¦ Package Registry (npmjs.com)
β
β (one-time fetch)
π₯ Local Cache/Mirror
β (content hash verified)
β
ποΈ Build Process
β (reads from cache only)
β
β
Output Artifact
Key principle: Fetch BEFORE build phase
## β
Hermetic Dockerfile
FROM ubuntu@sha256:82becede498899ec668628e7cb0ad87b6e1c371cb8a1e597d83a47fac21d6af3
## Exact image by content hash
COPY requirements.txt .
RUN pip install --no-index --find-links=/var/cache/pip -r requirements.txt
## Uses pre-fetched packages from cache
COPY data.json .
## Data checked into source control
π§ Memory device: Think "Network = No reproducibility". If your build makes network calls, it's not hermetic.
Example 1: The Package Manager Trap π¦
Scenario: You're building a Node.js application. Your package.json specifies:
{
"dependencies": {
"express": "^4.17.0",
"lodash": "~4.17.20"
}
}
What's wrong?
The ^ and ~ symbols allow range matching:
^4.17.0matches any4.x.xwherex >= 17~4.17.20matches any4.17.xwherex >= 20
Running npm install today might fetch:
express@4.18.2lodash@4.17.21
Running it next month might fetch:
express@4.19.0(new release!)lodash@4.17.21
Why it breaks:
ββββββββββββββββββββββββββββββββββββββββββββββββ β TIMELINE: Package Version Drift β ββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β Day 1: Developer A β β npm install β express@4.18.2 β β β Build works! β β β β β β β β Day 15: Express releases 4.19.0 β β (with breaking change) β β β β β β β β Day 16: Developer B β β npm install β express@4.19.0 β β β Build fails! "Cannot find module..." β β β ββββββββββββββββββββββββββββββββββββββββββββββββ
Hermetic solution: Lockfiles
## Generate lockfile (one time)
npm install
## Creates package-lock.json with EXACT versions
## Commit lockfile
git add package-lock.json
git commit -m "Add lockfile for hermetic builds"
## Future builds use exact versions
npm ci # NOT 'npm install'!
The package-lock.json contains:
{
"dependencies": {
"express": {
"version": "4.18.2",
"resolved": "https://registry.npmjs.org/express/-/express-4.18.2.tgz",
"integrity": "sha512-5/PsL6iGPdfQ/lKM1UuielYgv3BUoJfz1aUwU9vHZ+J7gyvwdQXFEBIEIaxeGf0GIcreATNyBExtalisDbuMqQ=="
}
}
}
Now every build uses 4.18.2 with verified content hash! π―
Example 2: The Environment Variable Backdoor πͺ
Scenario: Your Python build script adapts based on environment:
## build.py
import os
import sys
def build():
# β NON-HERMETIC: Behavior changes based on environment
debug_mode = os.getenv('DEBUG', 'false').lower() == 'true'
if debug_mode:
optimization_level = '-O0'
include_symbols = True
else:
optimization_level = '-O3'
include_symbols = False
# Build with different settings!
compile_code(optimization_level, include_symbols)
# β Even worse: Optional features
if os.getenv('ENABLE_EXPERIMENTAL'):
enable_experimental_features()
# β Catastrophic: Different output paths
output_dir = os.getenv('OUTPUT_DIR', './dist')
copy_artifacts(output_dir)
What's wrong?
- Two developers with different
$DEBUGsettings get different binaries - CI server might have different environment than local machines
- No way to know what environment produced a given artifact
Real-world impact:
| Developer | Environment | Binary Produced | Result |
|---|---|---|---|
| Alice | DEBUG=true | Unoptimized, 50MB | Works, but slow |
| Bob | (no DEBUG set) | Optimized, 5MB | Works, fast |
| CI Server | DEBUG=false explicitly | Optimized, 5MB | Works, fast |
| Carol | DEBUG=1 | Optimized (1 β 'true'!) | Unexpected behavior! |
Hermetic solution: Explicit build modes
## build.py
import argparse
def build(mode: str, output_dir: str):
"""Build with explicit, declared parameters.
Args:
mode: Must be 'debug' or 'release'
output_dir: Explicit output location
"""
# β
HERMETIC: All inputs are explicit arguments
if mode == 'debug':
optimization_level = '-O0'
include_symbols = True
elif mode == 'release':
optimization_level = '-O3'
include_symbols = False
else:
raise ValueError(f"Invalid mode: {mode}")
compile_code(optimization_level, include_symbols)
copy_artifacts(output_dir)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--mode', required=True, choices=['debug', 'release'])
parser.add_argument('--output-dir', required=True)
args = parser.parse_args()
build(args.mode, args.output_dir)
Now build behavior is determined by explicit arguments, not ambient environment:
## Everyone runs the same command for release builds
python build.py --mode=release --output-dir=./dist
## Debug builds are equally explicit
python build.py --mode=debug --output-dir=./dist-debug
π‘ Best practice: Environment variables are OK for non-semantic settings (like TMPDIR or HOME), but never for build logic!
Example 3: The Timestamp in Archive Problem πβ°
Scenario: You're creating a release tarball:
#!/bin/bash
## release.sh
## β NON-HERMETIC: Embedded timestamps
tar -czf release.tar.gz src/
## Tar includes modification times of all files!
## Even worse:
tar -czf "release-$(date +%Y%m%d).tar.gz" src/
## Filename changes every day
What's wrong?
Even if source code is identical, the tarball will be different because:
- File modification times differ
- Tarball creation time is recorded
- Filename includes current date
Demonstration:
ββββββββββββββββββββββββββββββββββββββββββββββββββ β IDENTICAL SOURCE, DIFFERENT TARBALLS β ββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β Build 1 (Jan 15): β β $ tar -czf release.tar.gz src/ β β $ sha256sum release.tar.gz β β a3f5c2... release.tar.gz β β β β Build 2 (Jan 16, same code): β β $ tar -czf release.tar.gz src/ β β $ sha256sum release.tar.gz β β d8e1f9... release.tar.gz β DIFFERENT! β β β β Problem: Can't verify bit-for-bit β β reproducibility β β β ββββββββββββββββββββββββββββββββββββββββββββββββββ
Hermetic solution: Reproducible archives
#!/bin/bash
## release.sh - Hermetic version
## β
Set deterministic timestamps
## Use SOURCE_DATE_EPOCH (Unix timestamp)
## Typically set to last commit time
export SOURCE_DATE_EPOCH=$(git log -1 --format=%ct)
## β
Sort files deterministically
find src/ -type f | sort > files.list
## β
Create archive with fixed timestamp
tar \
--sort=name \
--mtime="@${SOURCE_DATE_EPOCH}" \
--owner=0 --group=0 --numeric-owner \
--pax-option=exthdr.name=%d/PaxHeaders/%f,delete=atime,delete=ctime \
-czf release.tar.gz \
-T files.list
## β
Use content-based naming
HASH=$(sha256sum release.tar.gz | cut -d' ' -f1 | head -c 8)
mv release.tar.gz "release-${HASH}.tar.gz"
echo "Created reproducible archive: release-${HASH}.tar.gz"
Now the archive is deterministic:
- Same source code β same tarball
- Same checksum every time
- Can verify releases cryptographically
π§ Memory device: "Timestamps Are Random" - tar without flags is non-hermetic!
Example 4: The Code Generation Randomness π²
Scenario: Your build includes code generation from a schema:
## generate_api.py
import json
import uuid
def generate_client(schema_file):
with open(schema_file) as f:
schema = json.load(f)
code = ['// Auto-generated API client\n']
# β NON-HERMETIC: Random IDs
for endpoint in schema['endpoints']:
endpoint_id = str(uuid.uuid4())
code.append(f'const {endpoint["name"]}_ID = "{endpoint_id}";\n')
# β NON-HERMETIC: Unordered dict iteration
for name, params in schema['types'].items():
code.append(f'interface {name} {{\n')
for param in params:
code.append(f' {param["name"]}: {param["type"]};\n')
code.append('}\n')
return ''.join(code)
What's wrong?
uuid.uuid4()generates random UUIDs each time- Dictionary iteration order is undefined (Python <3.7) or implementation-dependent
- Generated code differs on every run, even with identical input
Impact on build system:
Run 1:
generate_api.py β client.ts (hash: abc123)
compile client.ts β client.js
Run 2 (no changes!):
generate_api.py β client.ts (hash: def456) β Changed!
compile client.ts β client.js β Unnecessary rebuild!
Cascading effect:
- Everything depending on client.js rebuilds
- Incremental builds don't work
- Cache is useless
Hermetic solution: Deterministic generation
## generate_api.py - Hermetic version
import json
import hashlib
def generate_stable_id(name: str, schema_hash: str) -> str:
"""Generate deterministic ID from inputs."""
# β
Same inputs always produce same ID
content = f"{name}:{schema_hash}".encode('utf-8')
return hashlib.sha256(content).hexdigest()[:16]
def generate_client(schema_file):
with open(schema_file) as f:
schema = json.load(f)
# β
Compute schema hash for deterministic IDs
schema_json = json.dumps(schema, sort_keys=True)
schema_hash = hashlib.sha256(schema_json.encode()).hexdigest()
code = ['// Auto-generated API client\n']
code.append(f'// Schema hash: {schema_hash}\n\n')
# β
Sort endpoints for deterministic order
for endpoint in sorted(schema['endpoints'], key=lambda e: e['name']):
endpoint_id = generate_stable_id(endpoint['name'], schema_hash)
code.append(f'const {endpoint["name"]}_ID = "{endpoint_id}";\n')
# β
Sort types and parameters
for name in sorted(schema['types'].keys()):
params = schema['types'][name]
code.append(f'interface {name} {{\n')
for param in sorted(params, key=lambda p: p['name']):
code.append(f' {param["name"]}: {param["type"]};\n')
code.append('}\n\n')
return ''.join(code)
Now generated code is perfectly reproducible:
- Same schema β same code, every time
- IDs are stable but unique per endpoint
- Build cache works correctly
- Can verify generated code in code review
π‘ Pro tip: For any code generation, include a hash of the input schema in the output as a comment. This makes it easy to verify determinism!
Common Mistakes β οΈ
Mistake 1: "It Works on My Machine" Syndrome π»
The problem: Developers test builds only on their own machines, missing environment dependencies.
## Seems fine locally...
import cv2 # OpenCV installed via brew
def process_image(path):
img = cv2.imread(path)
return cv2.resize(img, (640, 480))
## Works! Ship it! π
What happens: CI fails because OpenCV isn't installed, or a different version exists.
Fix: Test in clean environments regularly:
## Use Docker to simulate clean environment
docker run --rm -v $(pwd):/src python:3.11 /src/build.sh
## Or use virtual environments
python -m venv clean_env
source clean_env/bin/activate
pip install -r requirements.txt # Only declared deps!
./build.sh
Mistake 2: "Timestamp Doesn't Matter" β
The problem: Assuming timestamps only affect metadata, not functionality.
## "Just metadata, right?"
BUILD_INFO = f"Built on {datetime.now().isoformat()}"
print(BUILD_INFO) # Goes to stdout
## Stdout becomes part of build log
## Build log checksum changes
## Cache invalidated!
What actually happens: Timestamps cascade through the build system, invalidating caches and breaking reproducibility verification.
Fix: Never embed timestamps unless explicitly required (and then use SOURCE_DATE_EPOCH).
Mistake 3: "Optional Dependencies Are Fine" π
The problem: Making dependencies optional leads to inconsistent builds.
try:
import numpy as np
HAS_NUMPY = True
except ImportError:
HAS_NUMPY = False
def process_data(data):
if HAS_NUMPY:
return np.array(data).mean() # Fast path
else:
return sum(data) / len(data) # Slow path
What's wrong: Two builds with identical source produce different behavior (and performance!) based on whether numpy is installed.
Fix: Make all dependencies explicit and required:
import numpy as np # Hard requirement in requirements.txt
def process_data(data):
return np.array(data).mean()
Mistake 4: "Build Scripts Don't Need Versioning" π
The problem: Updating build scripts without versioning them.
## build.sh (updated on June 1st)
#!/bin/bash
gcc -O2 main.c -o app # Changed from -O3!
What happens: Old commits can't be rebuilt with original flags, making historical releases unreproducible.
Fix: Version build scripts with source code:
## Commit build script changes
git add build.sh
git commit -m "build: reduce optimization to -O2 for faster builds"
## Now old commits checkout old build.sh automatically!
Mistake 5: "Caching Speeds Up Builds" (Incorrectly) ποΈ
The problem: Aggressive caching that ignores hidden dependencies.
## cache.py
import os
import pickle
def get_cached_or_build(source_file, build_func):
cache_file = f"{source_file}.cache"
# β Only checks source file timestamp!
if os.path.exists(cache_file):
if os.path.getmtime(cache_file) > os.path.getmtime(source_file):
with open(cache_file, 'rb') as f:
return pickle.load(f)
result = build_func(source_file)
with open(cache_file, 'wb') as f:
pickle.dump(result, f)
return result
What's wrong: Misses changes to:
- Tools used by
build_func - Configuration files
- Environment variables
- Dependencies of
source_file
Fix: Use content-addressed caching:
import hashlib
import json
def get_cached_or_build(source_file, build_func, dependencies):
# β
Hash ALL inputs
hasher = hashlib.sha256()
# Source content
with open(source_file, 'rb') as f:
hasher.update(f.read())
# All dependencies
for dep in dependencies:
with open(dep, 'rb') as f:
hasher.update(f.read())
# Tool versions
hasher.update(get_tool_version('gcc').encode())
cache_key = hasher.hexdigest()
cache_file = f".cache/{cache_key}"
if os.path.exists(cache_file):
with open(cache_file, 'rb') as f:
return pickle.load(f)
result = build_func(source_file)
os.makedirs('.cache', exist_ok=True)
with open(cache_file, 'wb') as f:
pickle.dump(result, f)
return result
Key Takeaways π
π Non-Hermetic Pattern Checklist
| Pattern | Detection | Fix |
|---|---|---|
| Hidden Dependencies | Build fails on clean system | Declare all dependencies explicitly |
| Non-Determinism | Different outputs from same inputs | Sort collections, fix seeds, avoid timestamps |
| Timestamp Coupling | Artifacts differ only by timestamp | Use SOURCE_DATE_EPOCH, content hashing |
| Network Access | Build fails offline | Pre-fetch to cache, use lockfiles |
| Environment Variables | Different behavior per developer | Use explicit arguments |
| Optional Dependencies | Inconsistent feature availability | Make all dependencies required |
The Hermetic Build Test π§ͺ
Your build is hermetic if:
- β Clean environment test: Runs in fresh Docker container with only declared dependencies
- β Bit-for-bit reproducibility: Two builds produce identical output (same checksum)
- β Offline test: Works without network access (after initial dependency fetch)
- β Time travel test: Can rebuild old commits with original tools (versioned toolchain)
- β Platform independence: Produces same output on Linux, macOS, Windows (or explicitly platform-specific)
Quick Reference: Hermetic Alternatives
| Anti-Pattern | Hermetic Alternative |
|---|---|
npm install |
npm ci (with lockfile) |
datetime.now() |
os.environ['SOURCE_DATE_EPOCH'] |
uuid.uuid4() |
hashlib.sha256(content).hexdigest() |
for key in dict: |
for key in sorted(dict): |
wget url |
Pre-fetch to cache, verify checksum |
os.getenv('FLAG') |
argparse with required arguments |
| System Python packages | Virtual environment + requirements.txt |
Tools from $PATH |
Versioned tools in project |
The Golden Rule π
If it's not in the source repository and not explicitly declared as a dependency with a version/hash, it doesn't exist.
Treat any other input as a bug waiting to happen.
π Further Study
- Reproducible Builds Project - https://reproducible-builds.org/ - Comprehensive guide to achieving bit-for-bit reproducibility across different systems
- Bazel Build Encyclopedia - https://bazel.build/concepts/build-ref - Learn how Google's build system enforces hermeticity at scale
- Docker Multi-Stage Builds - https://docs.docker.com/build/building/multi-stage/ - Practical techniques for hermetic containerized builds with dependency isolation
Congratulations! π You can now identify and fix non-hermetic patterns in build systems. Remember: hermetic builds aren't just theoretical purityβthey save real debugging time and make teams more productive. The upfront investment in fixing these patterns pays dividends every single day.
Next up: We'll explore tools and frameworks that enforce hermeticity automatically, so you don't have to remember all these rules manually!