sequenceDiagram
participant User
participant CLI
participant BuildManager
participant Compiler
participant Filesystem
User->>CLI: xorq build pipeline.py -e features
CLI->>BuildManager: Load script
BuildManager->>BuildManager: Extract expression
BuildManager->>Compiler: Compile to YAML
Compiler->>Filesystem: Write expr.yaml
Compiler->>Filesystem: Write profiles.yaml
Compiler->>Filesystem: Write metadata.json
Note over Compiler,Filesystem: deferred_reads.yaml only with --debug
Compiler->>BuildManager: Return hash
BuildManager-->>User: builds/a3f5c9d2/
Build system
A feature pipeline runs perfectly on your laptop but breaks in production because of different Python versions and missing dependencies. The build system prevents this by compiling your expressions into portable YAML artifacts that execute identically across any environment.
What is the build system?
The build system compiles your Xorq expressions into executable artifacts. When you run xorq build, Xorq takes your Python code, extracts the expression graph, and generates YAML files that capture the computation logic.
These artifacts are self-contained. They include the expression definition, backend configurations, data source information, and metadata. You can share them, version them in Git, and execute them on any machine with Xorq installed.
# Build an expression
xorq build pipeline.py -e features
# Output: builds/a3f5c9d2/
# ├── expr.yaml
# ├── profiles.yaml
# └── metadata.json
# (deferred_reads.yaml generated only with --debug flag)Why the build system matters
Without builds, you distribute Python code and hope it runs the same way everywhere. Dependencies might differ, Python versions might vary, and execution behavior becomes unpredictable.
This creates three problems in production:
Environment dependency breaks deploys. Your code works on your laptop with Python 3.11 and pandas 2.1 but fails in production with Python 3.9 and pandas 1.5. Different package versions mean different behavior. You spend hours debugging environment mismatches that shouldn’t exist in the first place.
No separation of concerns slows iteration. Writing expressions and running them happen in the same step. You can’t validate expressions without executing them. You can’t execute without the development environment. Change one line? Redeploy the entire codebase.
Hard to version creates audit nightmares. Git versions your Python code, but not the computation itself. Two developers with the same code might produce different results if their environments differ. When regulators ask “Which version ran in March?”, you can’t answer confidently.
The build system solves these by creating a deterministic artifact that captures exactly what to compute, independent of the execution environment.
How the build system works
Building operates in four stages:
Expression extraction: Xorq loads your Python script and extracts the specified expression variable. It validates that the variable contains a valid Xorq expression.
Graph compilation: Xorq walks the expression graph and compiles it to a canonical YAML representation. This includes operations, schemas, and dependencies.
Artifact generation: Xorq generates three core files: expr.yaml contains the expression, profiles.yaml contains backend configs, and metadata.json contains build info. When you run xorq build with the --debug flag, Xorq also generates deferred_reads.yaml containing data source information.
Hash computation: Xorq computes a content hash from the expression logic. This hash becomes the build directory name and serves as the unique identifier. The build process follows this sequence:
The build process is deterministic. Building the same expression twice produces the same hash and identical artifacts. This enables reproducible deployments. The transformation from expression to build artifacts looks like this:
graph TB
A[Python Expression] --> B[Extract Graph]
B --> C[Compute Content Hash]
C --> D[Hash: a3f5c9d2]
B --> E[Compile to YAML]
E --> F[expr.yaml]
E --> G[profiles.yaml]
E --> H[metadata.json]
E -.Debug mode.-> I[deferred_reads.yaml]
D --> J[builds/a3f5c9d2/]
F --> J
G --> J
H --> J
I -.-> J
Builds separate development from execution. You write expressions in Python (development), build them to YAML (compilation), then run them anywhere (execution). This separation supports testing, versioning, and deployment workflows.
Build artifacts
Each build generates three core artifacts in a content-addressed directory, plus optional debug artifacts:
expr.yaml
The complete expression definition in YAML format. This includes:
- All operations like filters, joins, and aggregations
- Column schemas for each operation
- Dependencies between operations
- UDF definitions and configurations
This is the core artifact that gets executed. It’s engine-independent and portable.
profiles.yaml
Backend connection configurations. This specifies:
- Which backends the expression uses
- Connection parameters like host, port, and database
- Credential references to environment variables
Profiles are referenced by hash, allowing environment-specific configurations.
deferred_reads.yaml
Information about data sources that load at execution time:
- File paths for Parquet/CSV files
- Table names for database reads
- SQL queries for deferred reads
This supports late binding: Data sources resolve at execution time, not build time. This artifact is generated only when you run xorq build with the --debug flag.
metadata.json
Build metadata for reproducibility:
- Xorq version (
current_library_version) - Python version (
sys-version-info) - Git state (
git_state) if Git is present - Metadata version (
metadata_version)
This supports debugging and ensures you can identify the build environment. Note that dependency versions are not included in regular builds; use xorq uv-build for dependency tracking.
Build directory structure
Builds organize in a content-addressed directory structure:
builds/
├── a3f5c9d2e1b4/ # Build hash
│ ├── expr.yaml
│ ├── profiles.yaml
│ └── metadata.json
│ (deferred_reads.yaml only with --debug)
├── b7e3f1a8c5d9/ # Different expression
│ ├── expr.yaml
│ ├── profiles.yaml
│ └── metadata.json
The directory name is the content hash. Same expression = same hash = same directory. This provides automatic deduplication: If you build the same expression twice, you get the same directory.
How builds enable reproducibility
Builds provide three levels of reproducibility:
Computation reproducibility
The expr.yaml captures computation logic exactly. Same expression = same computation, regardless of when or where you run it.
# Developer A builds on Monday
xorq build features.py -e customer_features
# Hash: a3f5c9d2
# Developer B builds on Tuesday
xorq build features.py -e customer_features
# Hash: a3f5c9d2 (same!)
# Identical computation guaranteedEnvironment reproducibility
The metadata.json captures the build environment. You can recreate the exact Python environment that produced the build.
{
"current_library_version": "0.3.4",
"metadata_version": "0.0.0",
"sys-version-info": [3, 11, 5, "final", 0],
"git_state": {
"commit": "a3f5c9d2e1b4...",
"branch": "main"
}
}Execution reproducibility
The build artifacts are self-contained. You can execute them on any machine with Xorq, without the original Python code.
# Build on laptop
xorq build pipeline.py -e features
# Copy builds/ directory to server
scp -r builds/ server:/app/
# Execute on server (no pipeline.py needed!)
xorq run builds/a3f5c9d2When to build versus run directly
Deciding when to build depends on your deployment patterns and reproducibility requirements.
Build when:
- You’re deploying to production and need reproducible artifacts for scheduled execution.
- You need reproducible artifacts for auditing because compliance requires exact computation versioning.
- You want to separate development from execution so data scientists write expressions while engineers deploy builds.
- Multiple environments need the same computation so dev, staging, and production run identical logic.
- You’re versioning computations in Git to track which computation version ran when.
- Execution happens on different machines than development so you build locally and deploy remotely.
Run directly without building when
- You’re doing interactive development in notebooks and need to iterate quickly without build overhead.
- You’re prototyping and the build step slows experimentation.
- You don’t need reproducibility guarantees for one-off analysis or exploratory work.
- You’re running one-off analyses that never repeat and don’t need deployment or versioning.
- Development and execution happen on the same machine so portability isn’t required.
If you’re on a team where data scientists develop locally but pipelines run on production servers, builds bridge this gap. Scientists write Python locally and builds deploy to production without environment mismatches or version conflicts.
Build commands
Xorq provides two build commands:
xorq build
Standard build for regular Python environments:
xorq build pipeline.py -e features --builds-dir buildsThis uses your current Python environment and dependencies.
xorq uv-build
Hermetic build with isolated Python environment:
xorq uv-build pipeline.py -e features --builds-dir buildsThis uses uv to create an isolated environment with pinned dependencies. The build is fully reproducible across machines.
Use uv-build when reproducibility is critical for production deployments. Use regular build for development.
Trade-offs
Benefits: Reproducibility, portability, versioning via Git, separation of Python development from YAML execution, build-time validation, and deduplication.
Costs: Build step required, storage overhead, indirection, learning curve, and two failure modes at build time versus execution time.
Learning more
Expression format explains how builds generate expression manifests. Content-addressed hashing covers how build directories are named by content hash.
Compute catalog details how builds get registered in the catalog. Reproducible environments with uv discusses how uv-build creates hermetic builds.
Your first build tutorial provides hands-on practice with building. Build CLI reference covers complete build command documentation.