sequenceDiagram
participant User
participant Xorq
participant Normalizer
participant Hasher
User->>Xorq: Execute expression
Xorq->>Normalizer: Extract computation logic
Normalizer->>Hasher: Canonical form
Hasher->>Xorq: a3f5c9d2e1b4...
Xorq->>User: Build at builds/a3f5c9d2/
Content-addressed hashing
Two developers write identical feature engineering logic independently without knowing about each other’s work. Traditional versioning treats them as separate entities with different version numbers, timestamps, and build identifiers. Content-addressed hashing recognizes they’re computationally identical and assigns the same hash, which allows automatic reuse across your team.
What is content-addressed hashing?
Content-addressed hashing identifies computations by the logic they perform, including operations, filters, and transformations. Xorq generates a unique hash from your expression’s structure without considering when it ran or who executed it.
Two expressions with identical logic receive identical hashes even if they run on different days or machines, so anyone on your team computing this expression gets cached results immediately without coordination or manual version management.
import xorq.api as xo
# Developer A runs this on Monday
con = xo.connect()
data = con.read_parquet("data.parquet")
result_a = data.filter(xo._.amount > 100).group_by("category").agg(total=xo._.amount.sum())
# Developer B runs this on Tuesday
result_b = data.filter(xo._.amount > 100).group_by("category").agg(total=xo._.amount.sum())
# Both get the same hash: a3f5c9d2e1b4...
# Developer B reuses Developer A's cached resultsContent hashing versus traditional versioning
Understanding how content hashing differs from traditional approaches clarifies when to use each method for versioning.
| Aspect | Content hashing | Traditional versioning |
|---|---|---|
| Identifier | Hash of computation logic | Timestamp or version number |
| Stability | Same computation = same hash | Same computation = different versions |
| Reuse | Automatic via hash match | Manual via version comparison |
| Collision risk | Cryptographically unlikely | Common where v1 means different things |
| Human readability | Low like a3f5c9d2… | High like v1, v2, v3 |
| Cache invalidation | Built-in when hash changes | Manual when version bumps |
Xorq combines both approaches by using content hashes for machine addressing and aliases for human readability.
# Machine-addressable by hash
xorq run builds/a3f5c9d2
# Human-readable by alias
xorq catalog add builds/a3f5c9d2 --alias customer-features
xorq run customer-featuresWhy duplicate work happens without content addressing
Traditional versioning uses timestamps, version numbers, or commit hashes to identify computations. If you run the same computation twice, then it produces two different versions because metadata changed even though logic stayed identical. Teams can’t systematically detect duplicate work without manual inspection, coordination meetings, and centralized documentation.
Three symptoms reveal why missing content addressing creates costly problems.
Duplicate work wastes compute resources
Every developer recomputes the same features because there’s no systematic way to identify computational equivalence automatically. If three people independently build customer segmentation features and all spend 20 minutes computing identical aggregations, then that’s an hour of wasted compute that content addressing eliminates.
Version drift creates deployment confusion
Version numbers don’t convey what actually changed in the computation logic. If two teams independently create “customer_features_v3,” then nobody knows if they’re the same computation without reading code. Deploying the wrong v3 to production turns debugging into an archaeological investigation of version history.
Cache invalidation becomes manual guesswork
If you invalidate too aggressively, then you waste compute rerunning unchanged work. If you invalidate too conservatively, then you serve stale results. Either way, humans make decisions the system should make automatically. Content-addressed hashing makes the hash the definitive source of truth since identical computation produces identical hashes.
How Xorq generates content hashes
Xorq generates content hashes through three stages that transform expressions into stable identifiers.
Expression normalization walks your expression graph and extracts computation logic while ignoring metadata like timestamps or usernames.
Hash computation serializes the normalized expression to canonical form and applies MD5 hashing via Dask’s tokenize function.
Hash assignment uses this hash as the identifier for build directories, cache keys, and catalog entries. Xorq truncates the hash to 12 characters by default for human readability. The hash generation process follows this sequence:
The hash depends on what you’re computing, which means the operations, predicates, and transformations applied to data. If you change a filter threshold from 100 to 101, then you get a different hash because logic changed. If you run the same filter on different dates, then you get the same hash because logic remained constant. The hash computation pipeline looks like this:
graph TB
A[Expression Graph] --> B[Normalize]
B --> C[Extract Operations]
C --> D[Serialize to Canonical Form]
D --> E[Compute Hash]
E --> F[a3f5c9d2e1b4...]
F --> G[Build Directory]
F --> H[Cache Key]
F --> I[Catalog Entry]
Content hashes remain stable across time and space. The same computation produces identical hashes regardless of execution context.
What influences the hash
Xorq includes specific computational elements in the hash while excluding metadata that doesn’t affect logic.
Included: Computation logic
Operations: Filters, joins, aggregations, and transformations all influence the hash.
Predicates: Filter conditions and join conditions affect the hash. If you change amount > 100 to amount > 101, then you get different hashes.
Column references: Which columns you select, group by, or aggregate in the computation logic.
Function calls: UDFs and aggregation functions influence the hash. If you change sum() to mean(), then you get different results.
Operation order: Filter-then-group differs computationally from group-then-filter even with identical individual operations.
Excluded: Execution context
Input data values: The hash remains unchanged regardless of input data, so the same computation on different datasets produces identical hashes.
Execution metadata: Timestamps, user names, and machine IDs don’t influence hash computation.
Backend choice: Usually doesn’t change the hash, though backend-specific operations might affect it depending on semantics.
Running identical filters on different datasets produces identical hashes because Xorq hashes logic, not data values. Only changing the filter logic itself changes the hash, such as adjusting a threshold from 100 to 101. The same customer segmentation logic produces identical hashes whether you run it Monday or Friday on different data.
How content hashing enables automatic reuse
Content hashing provides three reuse patterns that eliminate duplicate work automatically.
Automatic cache reuse
When you execute an expression, Xorq checks if anyone has computed this hash before. If cache hits occur, then results return instantly. If cache misses occur, then Xorq executes and stores results for future reuse.
# First developer runs expensive computation
result = expensive_pipeline.execute() # Takes 10 minutes, caches with hash a3f5c9d2
# Second developer runs same computation
result = expensive_pipeline.execute() # Returns instantly from cacheTeam-wide discovery through catalogs
The catalog tracks which hashes exist so you can search for computations and discover others’ work.
# Check if this computation exists
xorq catalog ls | grep a3f5c9d2
# Output: customer-features a3f5c9d2 r1
# Someone already built this!Deterministic builds for deployment
Building the same expression multiple times produces the same hash for reproducible builds.
# Build on Monday
xorq build pipeline.py -e features
# Output: builds/a3f5c9d2/
# Build on Tuesday with no code changes
xorq build pipeline.py -e features
# Output: builds/a3f5c9d2/ (same hash!)Hash collisions and security considerations
Xorq uses MD5 hashing via Dask’s tokenize function to generate content hashes, truncated to 12 hexadecimal characters by default. With 12 hex characters, you have 16^12 possible values, which equals approximately 281 trillion combinations.
The collision probability remains extremely low for typical workflows. Even with 100,000 expressions computed across your entire team, the collision probability is roughly 0.002%. Most teams compute thousands or tens of thousands of expressions, well below any meaningful collision risk.
The choice of MD5 makes sense for content addressing because the goal is deterministic identifiers for computational graphs, not cryptographic security. MD5 provides fast, consistent hashing so identical expressions produce identical hashes reliably across different machines and time periods.
Content hashes are identifiers for addressing computations, not encryption for securing sensitive data. Xorq uses MD5 for fast, deterministic hashing where the goal is consistent identification rather than cryptographic security. Don’t rely on hash obscurity for security since you should use proper access controls and credential management instead.
Hashes are identifiers for addressing computations, not sequential versions for temporal ordering. You can’t determine which computation came before or after by comparing hash values. For human-readable versions, use catalog aliases with revision numbers like r1, r2, r3. Use aliases for human workflows like customer-features-r1 and hashes for machine addressing like builds/a3f5c9d2.
When content addressing provides value
Use content hashing when: Multiple people share features or models; team is larger than a few people; computation is expensive enough that cache reuse helps; reproducibility or compliance matters.
Skip content hashing when: Solo with no reuse; computations are very fast; you don’t use caching or catalogs; one-off analyses only.
Understanding trade-offs
Benefits: Automatic reuse, team discovery via catalog, reproducibility guarantees, deterministic deploys.
Costs: Hash opacity so use catalog aliases for human readability, computation and storage overhead, and learning curve.
Hashes are generated automatically during builds; you never compute them manually.
Learning more
Expression format covers expression manifests.
Build system discusses how builds generate and use content hashes. Compute catalog details catalog indexing. Intelligent caching system explains caching mechanisms using content hashes.
Your first build tutorial provides hands-on practice with content hashing. Input-addressed computation covers the broader concept.