OxyMake
Next-generation workflow orchestration in Rust. The
uvof computational workflows.
OxyMake is a fast, declarative workflow orchestration tool that combines the proven ideas of Snakemake (file-based rules, backward-chaining DAG, wildcards) with modern engineering: content-addressable caching, polyglot execution, in-memory data passing, and first-class support for both human and AI agent users.
Key Features
- Fast DAG resolution: 10K-job DAG resolved in 69 ms on M4 Max, 33.3× faster than Snakemake 7.32.4 (100K-job scaling out of scope for this benchmark wave; cold end-to-end is slower than Snakemake — an honest trade for content-addressable correctness)
- Content-addressable: no phantom re-runs from git checkout or file copies
- Polyglot: shell, Python, R, Julia — each rule chooses its language
- Daemon-free:
ox runstarts, works, exits. No server to manage. - Agent-friendly:
--jsonoutput, structured events, typed API - Scales: same workflow on laptop, SLURM cluster, or Ray cluster (Kubernetes designed, not yet implemented)
Quick Example
# Oxymakefile.toml
ox_version = "0.1"
[config]
samples = ["A", "B", "C"]
[rule.process]
input = ["data/{sample}.csv"]
output = ["results/{sample}.json"]
shell = "python process.py {input} {output}"
[rule.report]
input = ["results/{sample}.json"]
output = ["reports/summary.html"]
shell = "python report.py {input} > {output}"
ox run # build everything
ox run -j 8 # 8 parallel jobs
ox status # what's running?
ox plan # what would run?
Installation
OxyMake is a single binary called ox, written in Rust. There are several
ways to install it.
Install (from source)
git clone https://github.com/noogram/oxymake.git
cd oxymake
cargo install --path crates/ox-cli
This installs both ox and oxymake to ~/.cargo/bin/. Make sure this
directory is in your $PATH.
Development setup
For working on OxyMake itself:
git clone https://github.com/noogram/oxymake.git
cd oxymake
cargo build # debug build → target/debug/ox
cargo test --workspace # run all tests
cargo run --bin ox -- --help # run without installing
With just (recommended):
just build # debug build
just test # all tests
just demo # interactive feature demo
just lint # clippy checks
just ci # full CI check (fmt + lint + test + demo)
just --list # all available recipes
Prerequisites
Required
- Rust 1.85+ (for installation from source)
Optional (depending on your workflow)
- Python 3.9+ -- for rules using
lang = "python" - uv -- for
environment = { uv = "pyproject.toml" }(install uv) - conda/mamba -- for
environment = { conda = "..." } - Docker -- for
environment = { docker = "..." } - Nix -- for
environment = { nix = "..." }
Verify Installation
ox --version
# ox 0.1.0
ox init
# Initialized OxyMake project in .
# Created: Oxymakefile.toml
# Created: .oxymake/
What Gets Installed
OxyMake is a single binary with no runtime dependencies. All state is
stored in a .oxymake/ directory within your project:
your-project/
Oxymakefile.toml # Your workflow definition
.oxymake/
state.db # SQLite execution state
cache/ # Content-addressable cache
logs/ # Job execution logs
No daemon, no server, no background processes. Each ox run is a
self-contained process that reads state, executes, writes state, and exits.
Next Steps
Now that OxyMake is installed, head to Your First Workflow to build something.
Quickstart
Get up and running with OxyMake in under five minutes. This guide covers only features that are tested and working in v0.1.0.
Install
Build and install from source (Rust 1.85+ required):
git clone https://github.com/noogram/oxymake.git
cd oxymake
cargo install --path crates/ox-cli
This installs both ox and oxymake to ~/.cargo/bin/.
Verify:
ox --version
# ox 0.1.0
Create a Project
mkdir my-pipeline
cd my-pipeline
ox init
This creates a starter Oxymakefile.toml and a .oxymake/ directory.
The generated template uses
{input}and{output}placeholders for input/output file expansion, plus{config.key}for config substitution.
Your First Workflow
Create the Oxymakefile:
cat > Oxymakefile.toml << 'EOF'
ox_version = "0.1"
[config]
samples = ["A", "B"]
# Default target: require all results to exist.
[rule.all]
input = ["results/{sample}.txt"]
# Process each sample's CSV into a sorted text file.
[rule.process]
input = ["data/{sample}.csv"]
output = ["results/{sample}.txt"]
shell = "sort data/{sample}.csv > results/{sample}.txt"
EOF
Key concepts:
[config]defines variables. Heresamples = ["A", "B"]means OxyMake will create one job per sample.{sample}in paths and shell commands is replaced with each value from the config list.[rule.all]is the default target. It has inputs but no outputs, so it just ensures its inputs exist.- Use explicit paths with config variables in
shellcommands (e.g.,data/{sample}.csv), not{input}/{output}.
Create some input data:
mkdir -p data results
echo -e "charlie,3\nalpha,1\nbravo,2" > data/A.csv
echo -e "zulu,26\nmike,13" > data/B.csv
Validate
Check your Oxymakefile for errors:
ox lint
# Oxymakefile is valid (2 rules)
Preview (Dry Run)
See what OxyMake would do without running anything:
ox run --dry-run
Output:
Dry run: 2 job(s) would execute for 2 target(s)
[process-B] rule=process outputs=[results/B.txt]
[process-A] rule=process outputs=[results/A.txt]
Run
Execute the workflow:
ox run
Output:
Completed: 2 succeeded, 0 failed, 0 skipped, 0 cancelled (0.0s)
Check the results:
cat results/A.txt
# alpha,1
# bravo,2
# charlie,3
Caching
Run the same command again:
ox run
Output:
Cache: 2 of 2 job(s) up-to-date, skipping.
Completed: 0 succeeded, 0 failed, 2 skipped, 0 cancelled (0.0s)
Nothing ran. OxyMake detected that all inputs are unchanged and all outputs exist. Modify an input and re-run to see only the affected jobs execute.
Build a Specific Target
Build only one output:
rm results/A.txt
ox run results/A.txt
Only process-A runs. results/B.txt is untouched.
Multi-Step Pipeline
OxyMake resolves dependency chains automatically. Here is a two-step pipeline that uppercases text, then counts characters:
mkdir pipeline && cd pipeline
cat > Oxymakefile.toml << 'EOF'
ox_version = "0.1"
[config]
names = ["alice", "bob"]
[rule.all]
input = ["final/{name}.txt"]
[rule.uppercase]
input = ["raw/{name}.txt"]
output = ["mid/{name}.txt"]
shell = "tr '[:lower:]' '[:upper:]' < raw/{name}.txt > mid/{name}.txt"
[rule.count]
input = ["mid/{name}.txt"]
output = ["final/{name}.txt"]
shell = "wc -c < mid/{name}.txt > final/{name}.txt"
EOF
mkdir -p raw mid final
echo "hello world" > raw/alice.txt
echo "oxymake rocks" > raw/bob.txt
ox run --dry-run
# 4 jobs: uppercase-alice, uppercase-bob, count-alice, count-bob
ox run
# Completed: 4 succeeded, 0 failed, 0 skipped, 0 cancelled (0.0s)
cat final/alice.txt
# 12
OxyMake figures out that count depends on uppercase and runs them in the
correct order.
Error Handling
If a job fails, OxyMake stops and reports the failure:
cat > Oxymakefile.toml << 'EOF'
ox_version = "0.1"
[rule.broken]
output = ["out.txt"]
shell = "exit 1"
EOF
ox run
# error: job broken failed: exit code 1
# Completed: 0 succeeded, 1 failed, 0 skipped, 0 cancelled (0.0s)
# Exit code: 1
Use --keep-going (or -k) to continue running independent jobs even when
one fails:
cat > Oxymakefile.toml << 'EOF'
ox_version = "0.1"
[config]
items = ["ok", "fail"]
[rule.all]
input = ["out/{item}.txt"]
[rule.process]
input = ["in/{item}.txt"]
output = ["out/{item}.txt"]
shell = "if [ '{item}' = 'fail' ]; then exit 1; fi; cp in/{item}.txt out/{item}.txt"
EOF
mkdir -p in out
echo "good" > in/ok.txt
echo "bad" > in/fail.txt
ox run -k
# Completed: 1 succeeded, 1 failed, 0 skipped, 0 cancelled (0.0s)
# Exit code: 1
# out/ok.txt was created; out/fail.txt was not
Static Rules
Rules without config variables produce a single job:
cat > Oxymakefile.toml << 'EOF'
ox_version = "0.1"
[rule.greet]
output = ["greeting.txt"]
shell = "echo 'Hello OxyMake' > greeting.txt"
EOF
ox run
cat greeting.txt
# Hello OxyMake
Alternate Oxymakefile
Use -f to point to a different file:
ox run -f path/to/other.toml
Known Limitations (v0.1.0)
-j N(parallel execution): All jobs run sequentially regardless of the-jvalue.--set(config override): Does not override config values.
Next Steps
- Read Your First Workflow for a more detailed walkthrough
- Explore
ox run --helpfor all available options
Your First Workflow
This tutorial walks you through creating a simple 3-rule workflow from scratch. By the end, you will understand how OxyMake resolves dependencies, runs jobs, and caches results.
Step 1: Create a Project
Create a new directory and initialize OxyMake:
mkdir my-pipeline
cd my-pipeline
ox init
This creates a starter Oxymakefile.toml. We will replace its contents.
Step 2: Create Some Input Data
Create a data/ directory with two CSV files:
mkdir data
data/alice.csv:
name,score
Alice,85
Alice,92
Alice,78
data/bob.csv:
name,score
Bob,91
Bob,88
Bob,95
Step 3: Write the Workflow
Replace the contents of Oxymakefile.toml with:
ox_version = "0.1"
[config]
students = ["alice", "bob"]
# Rule 1: Compute statistics for each student
[rule.stats]
input = ["data/{student}.csv"]
output = ["results/{student}_stats.json"]
lang = "python"
run = """
import csv, json
scores = []
with open("{input}") as f:
for row in csv.DictReader(f):
scores.append(int(row["score"]))
stats = {
"student": "{wildcards.student}",
"mean": sum(scores) / len(scores),
"min": min(scores),
"max": max(scores),
"count": len(scores),
}
with open("{output}", "w") as f:
json.dump(stats, f, indent=2)
"""
# Rule 2: Combine all student stats into a summary
[rule.summary]
input = ["results/{student}_stats.json"]
output = ["results/summary.json"]
lang = "python"
run = """
import json, glob
all_stats = []
for path in sorted(glob.glob("results/*_stats.json")):
with open(path) as f:
all_stats.append(json.load(f))
with open("{output}", "w") as f:
json.dump(all_stats, f, indent=2)
"""
# Rule 3: Default target -- build the summary
[rule.all]
input = ["results/summary.json"]
This workflow has three rules:
- stats -- computes per-student statistics (runs once per student)
- summary -- combines all stats into one file
- all -- an aggregation target that tells OxyMake what to build
Interpolation note. Inside
run/shellblocks, OxyMake substitutes the placeholders it recognizes --{input},{output},{wildcards.X},{config.X}, and so on -- and leaves everything else untouched. It does not treat{{/}}as escaped braces, so write ordinary Python dict literals with single braces (stats = { ... }). The recognized placeholders are listed in the Expression Language reference.
Step 4: Plan
Before running, see what OxyMake will do:
ox plan
You should see something like:
Plan: 3 rules, 3 jobs, 2 source files
Targets: results/summary.json
1. [stats-bob] rule=stats -> [results/bob_stats.json]
2. [stats-alice] rule=stats -> [results/alice_stats.json]
3. [summary] rule=summary -> [results/summary.json]
OxyMake resolved the {student} wildcard from config.students and
created two concrete jobs for the stats rule (with the ids stats-alice
and stats-bob), plus one for summary.
Step 5: Run
ox run
Output (timings will vary):
Resolving 3 jobs (3 to run, 0 cached)
▸ summary — upstream rebuilt
✓ Completed 3/3 in 0.6s (4.8 jobs/s)
3 succeeded
Completed: 3 succeeded, 0 failed, 0 skipped, 0 cancelled (0.6s)
The last line is the canonical summary: N succeeded, N failed, N skipped, N cancelled. A run is successful when failed and cancelled are both 0.
Check the results:
cat results/alice_stats.json
{
"student": "alice",
"mean": 85.0,
"min": 78,
"max": 92,
"count": 3
}
Step 6: See Caching in Action
Run the same command again:
ox run
Output:
Cache: 3 of 3 job(s) up-to-date, skipping.
Completed: 0 succeeded, 0 failed, 3 skipped, 0 cancelled (0.0s)
Nothing ran. OxyMake detected that all inputs are unchanged and all
outputs exist with the correct content hashes, so all three jobs are
reported as skipped.
Now modify one input:
echo "Alice,99" >> data/alice.csv
ox run
Output:
Cache: 1 of 3 job(s) up-to-date, skipping.
Resolving 3 jobs (2 to run, 1 cached)
[1/3] ✓ stats-bob [cached]
▸ summary — upstream rebuilt
✓ Completed 3/3 in 0.4s (7.5 jobs/s)
2 succeeded, 1 skipped
Completed: 2 succeeded, 0 failed, 1 skipped, 0 cancelled (0.4s)
Only stats-alice and summary re-ran. stats-bob was cached (reported
as skipped) because its input did not change.
Step 7: Add a New Student
Edit Oxymakefile.toml and add a student:
[config]
students = ["alice", "bob", "charlie"]
Create the data file:
echo "name,score
Charlie,76
Charlie,82
Charlie,90" > data/charlie.csv
Run again:
ox run
Cache: 2 of 4 job(s) up-to-date, skipping.
Resolving 4 jobs (2 to run, 2 cached)
[1/4] ✓ stats-alice [cached]
[2/4] ✓ stats-bob [cached]
▸ summary — upstream rebuilt
✓ Completed 4/4 in 0.4s (10.5 jobs/s)
2 succeeded, 2 skipped
Completed: 2 succeeded, 0 failed, 2 skipped, 0 cancelled (0.4s)
Only the new student was computed. Alice and Bob's stats were cached
(reported as skipped).
What You Learned
- Rules declare intent -- input/output patterns with wildcards
- Config drives expansion --
students = [...]determines which jobs are created - Content-addressable caching -- unchanged inputs mean cached outputs
- Incremental execution -- adding data or rules only computes what is new
- Backward chaining -- OxyMake figures out the dependency order automatically
Next Steps
- The Three Graphs -- understand how OxyMake resolves your workflow
- Content-Addressable Cache -- why caching works correctly
- Execution Modes -- the four ways to execute a rule
Understanding the Output
When you run ox run, OxyMake provides structured feedback about what it is
doing and why. This page explains the output formats, using the 3-rule
workflow from Your First Workflow (stats for two
students, plus a summary).
Terminal Output (Default)
By default, OxyMake prints human-readable progress and ends with a canonical summary line (timings will vary):
Resolving 3 jobs (3 to run, 0 cached)
▸ summary — upstream rebuilt
✓ Completed 3/3 in 0.6s (4.8 jobs/s)
3 succeeded
Completed: 3 succeeded, 0 failed, 0 skipped, 0 cancelled (0.6s)
The last line is the canonical summary, always in the same shape:
Completed: N succeeded, N failed, N skipped, N cancelled (<elapsed>)
- succeeded -- jobs that ran and produced their outputs
- failed -- jobs whose command exited non-zero
- skipped -- jobs whose outputs were already up to date (cache hits)
- cancelled -- jobs that did not run because an upstream job failed
A run is successful (exit code 0) when both failed and cancelled are 0.
Cached Jobs
When outputs are already up to date, OxyMake skips the work and reports the
jobs as skipped:
Cache: 3 of 3 job(s) up-to-date, skipping.
Completed: 0 succeeded, 0 failed, 3 skipped, 0 cancelled (0.0s)
On a partial re-run (one input changed), the cached jobs are listed and the summary reflects the split:
Cache: 1 of 3 job(s) up-to-date, skipping.
Resolving 3 jobs (2 to run, 1 cached)
[1/3] ✓ stats-bob [cached]
▸ summary — upstream rebuilt
✓ Completed 3/3 in 0.4s (7.5 jobs/s)
2 succeeded, 1 skipped
Completed: 2 succeeded, 0 failed, 1 skipped, 0 cancelled (0.4s)
Plan Output
Use ox plan to see what would run without executing:
ox plan
Plan: 3 rules, 3 jobs, 2 source files
Targets: results/summary.json
1. [stats-bob] rule=stats -> [results/bob_stats.json]
2. [stats-alice] rule=stats -> [results/alice_stats.json]
3. [summary] rule=summary -> [results/summary.json]
The header reports the totals (N rules, N jobs, N source files), followed by
the requested targets and the concrete jobs, each shown as
[job-id] rule=<rule> -> [outputs].
JSON Output (Agent Mode)
Add --json to ox run for structured NDJSON output -- one self-contained
JSON event per line:
ox run --json
{"event":"run_started","total_jobs":3,"to_run":3,"cached":0}
{"event":"job_started","job_id":"stats-bob","executor":"local","reason":"cache_miss"}
{"event":"job_completed","job_id":"stats-bob","duration_ms":209,"outputs":["results/bob_stats.json"]}
{"event":"job_started","job_id":"stats-alice","executor":"local","reason":"cache_miss"}
{"event":"job_completed","job_id":"stats-alice","duration_ms":200,"outputs":["results/alice_stats.json"]}
{"event":"job_started","job_id":"summary","executor":"local","reason":"upstream_rebuilt"}
{"event":"job_completed","job_id":"summary","duration_ms":194,"outputs":["results/summary.json"]}
{"event":"run_completed","total":3,"succeeded":3,"failed":0,"skipped":0,"cancelled":0,"duration_ms":607}
Each event carries an event discriminant (run_started, job_started,
job_completed, run_completed). This format is designed for AI agents and
scripts to parse programmatically. Use --report-json <path> to write the
same stream to a file. See Agent-Driven Workflows
for details.
DAG Visualization
Use ox dag to render the dependency graph. The default format is Graphviz
DOT:
ox dag
digraph oxymake {
rankdir=LR;
"results/summary.json" -> "all";
"stats" -> "results/{student}_stats.json";
"data/{student}.csv" -> "stats";
"summary" -> "results/summary.json";
"results/{student}_stats.json" -> "summary";
}
Other formats:
ox dag --format mermaid # Mermaid graph syntax
ox dag --format dot # Graphviz DOT (same as default)
ox dag --group-by rule # Collapse nodes by field
ox dag --json # Structured JSON
To trace a single target's dependency chain instead, use ox explain:
ox explain results/summary.json
Dependency chain for: results/summary.json
► 1. [summary] rule=summary
inputs: [results/alice_stats.json, results/bob_stats.json]
outputs: [results/summary.json]
2. [stats-alice] rule=stats
inputs: [data/alice.csv]
outputs: [results/alice_stats.json]
3. [stats-bob] rule=stats
inputs: [data/bob.csv]
outputs: [results/bob_stats.json]
Error Output
When a job fails, OxyMake reports the failure, cancels the dependent jobs, and ends with a non-zero exit code:
Resolving 1 jobs (1 to run, 0 cached)
[1/1] ✗ broken FAILED (exit 1)
error: job broken failed: exit code 1
stderr: --- stderr ---
stderr: boom
✗ Completed 1/1 in <0.1s
1 failed
Failed: broken
Completed: 0 succeeded, 1 failed, 0 skipped, 0 cancelled (0.0s)
Failed jobs (showing 1 of 1):
broken: boom
Run 'ox logs --failed' for full details.
ox logs --failed prints the full captured output of each failed job. In
--json mode, the failure is reported as a job_completed event with a
non-success status, so automated tooling can recover programmatically.
Verbosity Levels
Control output detail with -v:
ox run # Normal output
ox run -v # Verbose: job start/end, durations, and exit codes
ox run -vv # Debug: also show each job's stdout/stderr
Next Steps
- CLI Commands -- full command reference
- Execution Modes -- how rules are executed
Rules and Wildcards
What is a Rule?
A rule declares a transformation: given these inputs, produce these outputs by running this command. OxyMake figures out what needs to run based on what you ask for.
[rule.process]
input = ["data/{sample}.csv"]
output = ["results/{sample}.txt"]
shell = "python process.py {input} {output}"
This single rule handles ANY sample. When you ask for results/A.txt,
OxyMake matches the output pattern, extracts sample = "A", substitutes
it into the input pattern to get data/A.csv, and runs the command.
Wildcards
Wildcards are placeholders in curly braces: {sample}, {cohort},
{model}. They appear in input and output file patterns.
How wildcards resolve
OxyMake uses backward chaining: start from the output you want, find which rule can produce it, extract wildcard values from the match.
You ask for: results/patient_42.txt
↓
Pattern: results/{sample}.txt
↓
Extracted: sample = "patient_42"
↓
Input becomes: data/patient_42.csv
Multiple wildcards
Rules can have multiple wildcards:
[rule.analyze]
input = ["data/{cohort}/{region}.parquet"]
output = ["results/{cohort}/{region}/report.html"]
shell = "python analyze.py {input} {output}"
Wildcard expansion from config
When you have a list of known values, put them in [config]:
[config]
samples = ["A", "B", "C"]
[rule.all]
input = ["results/{sample}.txt"]
The all rule has {sample} in its inputs but no outputs — it's an
aggregation target. OxyMake expands {sample} from config.samples
to request results/A.txt, results/B.txt, results/C.txt.
Expansion modes
When multiple wildcards expand from config lists, the expansion can be:
[config]
samples = ["A", "B"]
conditions = ["treated", "control"]
[rule.experiment]
output = ["results/{sample}_{condition}.csv"]
expand = "product" # default: A_treated, A_control, B_treated, B_control
| Mode | Behavior | Count |
|---|---|---|
product (default) | All combinations (Cartesian product) | N × M |
zip | Parallel pairs (lengths must match) | N |
Wildcard constraints
Restrict which values a wildcard can take:
[rule.process]
output = ["results/{sample}.txt"]
[rule.process.wildcard_constraints]
sample = "[A-Z][a-z0-9_]*" # regex: starts with uppercase letter
Conditional guards
Rules can apply only to certain wildcard values:
[config]
special_samples = ["X1", "X2"]
[rule.extra_analysis]
input = ["results/{sample}.txt"]
output = ["extra/{sample}_analysis.html"]
when = "sample in @special_samples"
This rule exists only for samples X1 and X2. Other samples don't get the extra analysis — no phantom nodes in the graph, no skipped jobs.
Guards support: in @list, not in @list, == 'value', != 'value',
=~ 'regex'.
The Four Execution Modes
| Mode | Keyword | Who manages I/O | In-memory possible |
|---|---|---|---|
| Shell | shell = "..." | You | No |
| Inline script | run = "..." | You | No |
| External script | script = "path" | You | No |
| Pure function | call = "mod:fn" | OxyMake | Yes |
Start with shell or run for quick prototyping. Migrate to call
when your function stabilizes and you want OxyMake to optimize I/O.
See Execution Modes for details.
The Three Graphs
OxyMake uses three distinct graph representations, each at a different level of abstraction. Understanding them is key to understanding how OxyMake works — and how to debug when things go wrong.
Overview
graph TD
A[Oxymakefile.toml] --> B["RuleGraph<br/><i>What you declared (abstract, compact)</i>"]
B -->|"Wildcard resolution<br/>+ guard evaluation"| C["JobGraph<br/><i>What will execute (concrete, optimized)</i>"]
C -->|"Runtime state annotation"| D["ExecGraph<br/><i>What is happening (live status)</i>"]
RuleGraph — The Logical View
The RuleGraph is what you wrote in the Oxymakefile. Each rule is a node, and edges connect rules whose output patterns match other rules' input patterns. Wildcards are NOT resolved — this is the abstract view.
A single call node represents ALL variant-call instances, not a specific one.
$ ox plan --level=rules
data ──→ features ──→ call ──→ annotate
What you can learn from the RuleGraph:
- Is my pipeline structure correct?
- Are there circular dependencies?
- Which rules depend on which?
Inspect it: ox plan --level=rules
JobGraph — The Physical Plan
The JobGraph is the RuleGraph after wildcard resolution. Every concrete
job instance is a separate node. With 3 cohorts and 4 windows, a
single features rule becomes 12 concrete jobs.
The JobGraph goes through optimization passes before execution:
| Pass | What it does |
|---|---|
| Cache pruning | Marks up-to-date jobs as "skip" |
| Task fusion | Merges sequential call-mode jobs |
| Materialization elimination | Removes unnecessary file I/O |
| Critical path analysis | Prioritizes bottleneck jobs |
These passes run internally; ox plan reports the resolved jobs after
optimization. For the 3-rule workflow from
Your First Workflow:
$ ox plan
Plan: 3 rules, 3 jobs, 2 source files
Targets: results/summary.json
1. [stats-bob] rule=stats -> [results/bob_stats.json]
2. [stats-alice] rule=stats -> [results/alice_stats.json]
3. [summary] rule=summary -> [results/summary.json]
The header line summarizes the graph (N rules, N jobs, N source files),
followed by the requested targets and the concrete jobs that would run.
What you can learn from the JobGraph:
- How many concrete jobs will execute?
- Which rule produced each job, and what outputs it writes?
- Which jobs are already cached? (re-run after a build to see fewer jobs)
Inspect it: ox plan (optimized, the default), ox plan --no-optimize
(skip the optimization passes), or ox plan --level rules to view the
RuleGraph instead of the JobGraph.
ExecGraph — The Live Execution
The ExecGraph is the JobGraph annotated with runtime state. Each node carries its status (Pending → Running → Completed/Failed), timing, and resource usage.
$ ox status --group-by stage
data 3/3 completed
features 145/3412 running (12%)
call waiting (blocked)
annotate waiting
What you can learn from the ExecGraph:
- What's running right now?
- What failed and why?
- How long has each job been running?
- Which sessions are active?
Inspect it: ox status
The Relationship
Each graph is a refinement of the previous one:
| Property | RuleGraph | JobGraph | ExecGraph |
|---|---|---|---|
| Nodes | Rules (abstract) | Concrete jobs | Jobs + status |
| Wildcards | Unresolved | Resolved | Resolved |
| Size | Small (tens) | Large (thousands) | Same as JobGraph |
| Lifetime | Static (parse time) | Static (plan time) | Dynamic (runtime) |
| Changes during run | Never | Grows (checkpoints) | Continuously |
Vocabulary
To avoid confusion, OxyMake uses these terms consistently:
- Rule = a declaration in the Oxymakefile (unresolved wildcards)
- Job = a concrete, executable instance of a rule (wildcards resolved)
- Pass = an optimization transformation on the JobGraph
- Phase = a stage of the pipeline (parse → resolve → optimize → execute)
Content-Addressable Cache
One of the most frustrating things about traditional build tools is the phantom re-run: you check out a branch, and everything rebuilds even though nothing actually changed. OxyMake eliminates this by using file content as the source of truth, not timestamps.
How It Works
Every time OxyMake runs a job, it computes a cache key from everything that could affect the output:
cache_key = blake3(
format_version ||
rule_source_hash ||
sorted((input_path, input_content_hash) pairs) ||
params_hash ||
env_content_hash ||
shell_executable ||
platform
)
Every field is length-framed with a domain-separation tag, so two different job specifications can never hash to the same key. If the key matches a previously computed result, the job is skipped. The key includes:
- Rule source hash -- if you change the shell command, inline code, or function reference, the cache is invalidated
- Input content hashes -- blake3 of every input file's contents, bound
to its path; parameter files and (in script mode) the script file itself
count as inputs, so editing
script.pyinvalidates the cache - Params hash -- any parameters passed via
--setor[config] - Environment content hash -- the content of the referenced spec file
(
requirements.txt, conda YAML, nix expression), or the container image reference for Docker/Apptainer - Shell executable -- the same command under
/bin/bashand/bin/zshcan behave differently - Platform -- OS and architecture (a Linux build is not reusable on macOS)
Two exclusions to know about: call-mode function bodies are tracked only
if you declare the module as an input, and mutable container tags are
hashed as written (pin images by digest -- python@sha256:... -- if you
need re-pushed tags to invalidate the cache).
Why Not Timestamps?
Timestamps lie. Here are common situations where they cause phantom re-runs in tools like Make or Snakemake:
| Scenario | What happens to mtime | Content changed? |
|---|---|---|
git checkout | Reset to now | No |
cp without -p | Reset to now | No |
| NFS clock skew | Arbitrary | No |
| CI fresh clone | All files are "new" | No |
touch command | Updated | No |
Validation Strategies (ADR-006)
OxyMake's cache validation is pluggable — you choose the right speed/correctness tradeoff for your workflow:
| Strategy | Flag | Behavior |
|---|---|---|
mtime+hash (default) | --cache-validation=mtime+hash | If mtime/size differ, compute BLAKE3 hash. Fast on steady-state, correct on change. |
mtime (opt-in) | --cache-validation=mtime | Pure filesystem metadata (stat calls only). Fastest, but never verifies content — unsuitable for shared/multi-user caches. |
hash | --cache-validation=hash | Always compute BLAKE3 hash. Bit-exact. Required for shared/remote caches. |
ox run # default: mtime+hash (fast + content-verifying)
ox run --cache-validation=mtime # Make-parity opt-in (no content check)
ox run --cache-validation=hash # strict mode (CI)
OX_CACHE_VALIDATION=hash ox run # via environment variable
Configure per project in Oxymakefile.toml:
[config]
cache_validation = "mtime+hash"
Remote caches automatically promote to hash regardless of the configured
strategy, because mtime is not meaningful across machines.
The Cache on Disk
Cached outputs live in .oxymake/cache/, organized by hash prefix:
.oxymake/cache/
a3/
a3f7b2c1... # cached output file
b1/
b1e9d4a8... # another cached output
This directory is independent of the SQLite state database. You can share it across machines, back it up, or delete it without losing execution state (jobs will simply re-run and repopulate the cache).
Sharing Across Machines
Because the cache key is deterministic -- same inputs, same rule, same environment, same platform produce the same key -- you can share cached outputs via S3, GCS, or any shared filesystem:
# Production: everything cached locally
ox run
# CI: pull from shared remote cache
ox run --cache-remote s3://my-bucket/oxymake-cache
For remote caches, OxyMake adds a trust_scope to prevent cache poisoning:
cached outputs from untrusted branches cannot be used by production builds.
Cache and Materialization
The cache interacts with the materialization policy:
| Policy | Written to disk? | Cached? |
|---|---|---|
always (default) | Yes | Yes |
auto | Only if needed | Yes, when materialized |
never | No (memory only) | No |
final | Only if DAG leaf | Yes, when materialized |
Outputs with materialize = "never" are kept in memory and never enter the
cache. This is a deliberate trade-off: you get speed at the cost of
reproducibility. The next ox run will recompute them.
Managing the Cache
# See cache size
ox gc --dry-run
# Limit cache to 10 GB (removes oldest entries)
ox gc --max-cache-size 10G
# Remove all cached outputs
ox clean --cache
Why This Matters
The content-addressable cache means you can:
- Switch branches freely without phantom re-runs
- Add new rules without invalidating existing cached results
- Share computation across machines and CI
- Resume interrupted runs -- completed work is preserved
- Trust the result -- if OxyMake says "cached," the output is bit-for-bit identical to what a fresh run would produce
Materialization Policy
When a call-mode rule produces an output, does it need to be written to
disk? Not always. OxyMake lets you control this with the materialization
policy, enabling significant speedups for workflows where intermediate
outputs are only consumed by other call-mode rules.
The Four Policies
| Policy | Behavior |
|---|---|
always | (default) Write to disk after every job. Reproducible, cacheable. |
auto | Write to disk only if a downstream job needs a file (not a call peer) |
never | Keep in memory only. Lost if the process dies. Not cached. |
final | Write to disk only if this output is a leaf of the DAG (a final result) |
Declaring Materialization
Set the policy on individual outputs:
[rule.compute_features]
output = [{
path = "features/{sample}.parquet",
format = "parquet",
materialize = "auto"
}]
call = "pipeline.features:compute_features"
lang = "python"
[rule.train_model]
output = [{
path = "models/{sample}.pkl",
format = "pickle",
materialize = "always"
}]
call = "pipeline.model:train"
lang = "python"
In this example, the features DataFrame is only written to disk if a
non-call downstream rule needs it as a file. The model is always saved.
Setting the Policy per Output
Materialization is declared per output in the Oxymakefile, on the structured output form:
[rule.compute_features]
# ...
output = [
{ path = "data/features.parquet", materialize = "auto" },
]
Valid values are auto (the default — write to disk only when a downstream
file consumer needs it), never (keep in memory; no disk, no caching),
final (write only leaf outputs), and always (write and cache every
output). There is no global ox run flag that overrides the policy today;
control it in the Oxymakefile per output.
Guidance for development workflows:
- During prototyping, set
materialize = "never"on intermediate outputs to iterate fast - For production, use the default
auto(oralways) for full caching and reproducibility - For presentations or reports, set leaf outputs to
final
How It Works with call Mode
When two consecutive rules both use call mode on the local executor,
OxyMake can pass data directly in memory:
compute_features ──[DataFrame in memory]──> train_model
(call) (call)
No file is written between them. The format field tells OxyMake how to
serialize the data if materialization is needed later (e.g., for caching
or for a shell-mode downstream rule).
The Flow
compute_featuresruns and returns a DataFrame- If
materialize = "auto"and the next consumer is alsocallmode: pass the DataFrame directly in memory - If
materialize = "auto"and the next consumer isshellmode: write the DataFrame to disk using theparquetcodec - If
materialize = "always": always write to disk (and cache) - If
materialize = "never": never write to disk (no cache)
Constraints
Not everything supports non-always materialization:
shell,run, andscriptmodes always materialize. They manage their own I/O and need real files.- Distributed executors (SLURM, K8s) force materialization because jobs run on separate machines.
- Non-materialized outputs are not cached. If the process dies or you restart, they will be recomputed. This is an explicit trade-off: speed vs. reproducibility.
The --materialize Flag
The CLI flag sets the floor for materialization:
| Flag value | Effect |
|---|---|
always | All outputs written to disk (default behavior) |
auto | Per-output policy respected |
never | No outputs written (memory only, for testing) |
final | Only DAG-leaf outputs written |
Practical Example
Consider a three-stage pipeline:
[rule.load_data]
output = [{ path = "data/{s}.parquet", format = "parquet", materialize = "auto" }]
call = "pipeline:load_data"
lang = "python"
[rule.compute_features]
output = [{ path = "features/{s}.parquet", format = "parquet", materialize = "auto" }]
call = "pipeline:compute_features"
lang = "python"
[rule.generate_report]
output = [{ path = "reports/{s}.html", materialize = "always" }]
call = "pipeline:generate_report"
lang = "python"
With ox run --materialize=final:
load_dataoutput: kept in memory (not a leaf)compute_featuresoutput: kept in memory (not a leaf)generate_reportoutput: written to disk (it is a leaf)
Only the final HTML report touches the filesystem. The intermediate parquet files exist only in memory during execution.
Tags and Filtering
Tags let you organize rules into logical groups and selectively run subsets of your workflow.
Assigning Tags
Add tags to any rule in your Oxymakefile.toml:
[rule.align]
input = ["data/{sample}.fastq"]
output = ["aligned/{sample}.bam"]
shell = "bwa mem ref.fa {input} | samtools sort > {output}"
tags = ["alignment", "compute-heavy"]
[rule.qc]
input = ["aligned/{sample}.bam"]
output = ["qc/{sample}_report.html"]
shell = "fastqc {input} -o qc/"
tags = ["qc", "fast"]
Filtering by Tag
Run only jobs matching a tag:
ox run --tag alignment # Only alignment jobs
ox run --tag qc # Only QC jobs
ox run --tag compute-heavy # Only compute-heavy jobs
Exclude jobs by tag:
ox run --exclude-tag slow # Skip slow jobs
Tag-Based DAG Views
Tags integrate with the DAG visualization:
ox dag --group-by tag # Group nodes by tag in the DAG view
ox plan --tag alignment # Show plan for alignment jobs only
Hierarchical Organization
Use dotted tag names for hierarchy:
tags = ["pipeline.alignment", "resource.gpu"]
This enables filtering at different levels:
ox run --tag "pipeline.*" # All pipeline stages
ox run --tag "resource.gpu" # Only GPU jobs
Use Cases
- Selective re-runs: Re-run only QC after parameter changes
- Resource-based scheduling: Tag GPU vs CPU jobs for different executors
- Stage grouping: Organize large workflows into logical phases
- Development iteration: Run only the stage you are working on
Next Steps
- The Three Graphs -- how tags affect DAG visualization
- Execution Modes -- how jobs are executed
Execution Modes
OxyMake supports four ways to execute a rule, forming a spectrum from maximum flexibility to maximum optimization. All four modes coexist in the same workflow -- you pick the right one for each rule.
The Spectrum
shell Opaque, files only, maximum flexibility
run Inline script, files only, author manages I/O
script External script, files only, author manages I/O
call Pure function, files OR memory, OxyMake manages I/O
As you move from shell to call, OxyMake gains more optimization power
(in-memory data passing, task fusion, automatic serialization) -- but you
give up direct control over I/O.
Mode 1: shell -- Command Line
The most flexible mode. You write a shell command, and OxyMake interpolates file paths into it.
[rule.align]
input = ["data/{sample}.fastq", "refs/genome.fa"]
output = ["results/{sample}.bam"]
shell = "bwa mem -t {resources.cpu} {input[1]} {input[0]} > {output}"
resources = { cpu = 8 }
Use shell when you are wrapping an existing command-line tool. OxyMake
treats the command as a black box -- it just passes file paths and checks
that outputs were created.
Mode 2: run -- Inline Script
Write a short script directly in the Oxymakefile. OxyMake interpolates
{input} and {output} as file paths.
[rule.analyze]
input = ["data/{sample}.csv"]
output = ["results/{sample}.json"]
lang = "python"
run = """
import pandas as pd
import json
df = pd.read_csv("{input}")
stats = df.describe().to_dict()
with open("{output}", "w") as f:
json.dump(stats, f)
"""
Use run for rapid prototyping -- when the logic is short enough to live
in the workflow file. You manage all file I/O yourself.
Mode 3: script -- External Script
Like run, but the code lives in a separate file. Keeps the Oxymakefile
clean when scripts are long.
[rule.transform]
input = ["data/{sample}.parquet"]
output = ["results/{sample}.parquet"]
script = "scripts/transform.py"
environment = { uv = "pyproject.toml" }
The script receives file paths via command-line arguments or environment variables.
Mode 4: call -- Pure Function
The key innovation. Your function receives objects, not file paths, and returns objects. OxyMake handles all I/O outside the function.
[rule.compute_features]
input = [{ path = "data/{sample}.parquet", format = "parquet" }]
output = [{ path = "features/{sample}.parquet", format = "parquet", materialize = "auto" }]
call = "pipeline.features:compute_features"
lang = "python"
The Python function is pure:
import polars as pl
def compute_features(df: pl.DataFrame) -> pl.DataFrame:
return df.with_columns(
mean_depth=pl.col("depth").rolling_mean(20),
depth_std=pl.col("depth").rolling_std(60),
)
The function never reads or writes files. OxyMake:
- Reads the input file using the
parquetcodec, producing a DataFrame - Calls
compute_features(df)and receives the result - Writes the result to disk using the
parquetcodec (if materialization policy requires it)
In memory mode (when both upstream and downstream are call rules on
the local executor), step 1 receives the DataFrame directly from the
upstream job and step 3 passes it directly to the downstream job -- zero
disk I/O.
Named Arguments
For functions with multiple inputs, use named inputs:
[rule.train_model]
input = { features = "features/{sample}.parquet", config = "configs/model.yaml" }
output = { model = "models/{sample}.pkl" }
call = "pipeline.model:train"
lang = "python"
def train(features: pl.DataFrame, config: dict) -> Model:
...
The input keys (features, config) map to function parameter names.
When to Use Each Mode
| Situation | Recommended mode |
|---|---|
| Wrapping an existing CLI tool | shell |
| Quick one-off analysis | run |
| Reusable script, too long for inline | script |
| Pure data transformation, wants optimization | call |
| Prototyping (will refactor later) | run then migrate to call |
The Migration Path
The natural evolution of a rule:
- Start with
run: Write inline code during exploration - Extract to
script: When the code gets long, move it to a file - Refactor to
call: When the function stabilizes, make it pure and let OxyMake manage I/O
Each step is backward-compatible -- the outputs are the same files. The cache key changes (because the rule source changes), so the first run after migration will recompute, but subsequent runs benefit from the optimization.
Interaction with Executors
| Mode | Local executor | SLURM/K8s executor |
|---|---|---|
shell | Subprocess | Remote submission |
run | Subprocess | Remote submission |
script | Subprocess | Remote submission |
call (memory) | In-process via Arrow IPC | Forced to materialize |
call (file) | Subprocess + codec | Remote submission + codec |
Distributed executors (SLURM, K8s) cannot pass objects in memory between
machines, so they automatically force call mode to materialize. Your
workflow does not need to change -- OxyMake handles this transparently.
Environments
Real-world workflows need specific software packages, library versions, and runtime configurations. OxyMake supports multiple environment backends that isolate each rule's execution in a reproducible environment.
Declaring an Environment
Add an environment field to any rule:
[rule.analyze]
input = ["data/{sample}.csv"]
output = ["results/{sample}.json"]
lang = "python"
environment = { uv = "pyproject.toml" }
run = """
import pandas as pd
df = pd.read_csv("{input}")
df.describe().to_json("{output}")
"""
The environment is resolved at execution time. OxyMake ensures the environment is set up before the rule runs.
Supported Backends
uv (Python)
The recommended backend for Python workflows. Uses
uv to create and manage virtual environments
from a pyproject.toml or requirements.txt.
environment = { uv = "pyproject.toml" }
OxyMake calls uv sync to ensure the environment matches the lockfile.
The environment hash (from uv.lock) is included in the cache key, so
changing a dependency invalidates affected outputs.
conda
For workflows that need non-Python packages (C libraries, R, etc.):
environment = { conda = "environment.yaml" }
OxyMake creates or updates a conda environment from the YAML specification.
Docker / OCI Containers
For maximum isolation and reproducibility:
environment = { docker = "python:3.11-slim" }
The job runs inside a container. OxyMake mounts the workspace and handles input/output file staging. The image digest is included in the cache key.
Nix
For fully reproducible builds with Nix:
environment = { nix = "flake.nix#devShell" }
Apptainer (Singularity)
For HPC environments where Docker is unavailable:
environment = { apptainer = "image.sif" }
System (default)
No isolation. Uses whatever Python/R/tools are on $PATH:
environment = { system = true }
This is the default when no environment is specified. Suitable for
shell-mode rules that call system utilities.
How Isolation Works
Each environment backend follows the same lifecycle:
- Resolve: Determine the exact environment specification (lockfile hash, image digest, flake hash)
- Prepare: Create or update the environment if needed (
uv sync,docker pull,conda env create) - Execute: Run the job inside the environment
- Hash: Include the environment specification hash in the cache key
The key insight is step 4: the environment specification is part of the
cache key. If you update a dependency in pyproject.toml and the lockfile
changes, all rules using that environment will be recomputed.
Mixing Environments
Different rules can use different environments in the same workflow:
[rule.download]
environment = { system = true }
shell = "wget {url} -O {output}"
[rule.analyze]
environment = { uv = "pyproject.toml" }
call = "analysis:run"
lang = "python"
[rule.visualize]
environment = { conda = "envs/plotting.yaml" }
script = "scripts/plot.R"
OxyMake manages each environment independently. There is no requirement that all rules share the same environment.
Environment and Executors
| Executor | Environment handling |
|---|---|
| Local | Environment resolved on the local machine |
| SLURM | Environment must be available on compute nodes |
| K8s | Docker image used as the pod container |
| Ray | Environment resolved on Ray worker nodes |
For SLURM, ensure that conda environments or uv projects are accessible from the compute nodes (e.g., on a shared filesystem).
Environment Caching
Environment setup can be slow (minutes for large conda environments). OxyMake caches the prepared environment and only re-creates it when the specification changes:
- uv: Rebuilds when
uv.lockchanges - conda: Rebuilds when
environment.yamlchanges - Docker: Re-pulls when the image tag resolves to a new digest
- Nix: Rebuilds when the flake lock changes
This means the first run may be slow (environment setup), but subsequent runs reuse the prepared environment instantly.
Executors
OxyMake separates what to run (rules, DAG) from where to run it
(executors). The same workflow runs on a laptop or a thousand-node cluster
with zero changes -- just switch the --executor flag.
Available Executors
| Executor | Flag | Backend | GPU | Memory Passing |
|---|---|---|---|---|
| Local | --executor local (default) | Tokio thread pool | OS-level | Same-process |
| SLURM | --executor slurm | sbatch / sacct | GRES | Shared filesystem |
| Ray | --executor ray | Ray Jobs API | First-class | Object store (zero-copy) |
| Kubernetes | --executor k8s | kube-rs (planned) | Device plugin | -- |
Local Executor
The default. Runs jobs as subprocesses on the local machine.
ox run # single job at a time
ox run -j 8 # 8 parallel jobs
Best for development, small pipelines, and single-node execution.
SLURM Executor
Submits jobs to an HPC cluster via sbatch and polls status with sacct.
ox run --executor slurm
Features:
- Job arrays for wildcard expansions
- GPU scheduling via GRES
- Resource mapping:
cpu,mem,gpumap to SLURM--cpus-per-task,--mem,--gres=gpu:N
Ray Executor
Submits jobs to a Ray cluster via the Ray Jobs API. Ray provides elastic distributed execution with a shared object store for fast intermediate data passing.
Setup
Start a Ray head node (or connect to an existing cluster):
ray start --head
# Dashboard: http://127.0.0.1:8265
Run the workflow:
ox run --executor ray
Configuration
Configure the Ray executor in .oxymake/config.toml or Oxymakefile.toml:
[executor.ray]
dashboard_address = "http://127.0.0.1:8265"
working_dir = "/shared/oxymake"
poll_interval_min = "2s"
poll_interval_max = "30s"
max_submit = 10
| Setting | Default | Description |
|---|---|---|
dashboard_address | http://127.0.0.1:8265 | Ray dashboard URL |
working_dir | . | Staging directory on shared filesystem |
poll_interval_min | 2s | Minimum status polling interval |
poll_interval_max | 30s | Maximum status polling interval |
max_submit | unlimited | Max concurrent job submissions |
autoscaler_aware | false | Query cluster capacity before submitting |
Resource Mapping
| OxyMake | Ray | Notes |
|---|---|---|
cpu | num_cpus | Direct mapping |
mem | memory | Bytes |
gpu | num_gpus | Fractional GPUs supported (gpu = 0.5) |
custom:* | Custom resources | Arbitrary Ray custom resources |
Memory Passing
When two consecutive call-mode rules run on the Ray executor, data passes
through Ray's object store without disk writes. OxyMake's materialization
policies map to Ray behavior:
| Policy | Ray Behavior |
|---|---|
always | Write to shared FS + object store |
auto | Object store only (materialized if downstream needs file) |
never | Object store only, evicted after consumers finish |
final | Object store, written to shared FS only for DAG leaves |
Execution Modes
The Ray executor supports all four execution modes:
- shell -- commands run as Ray job entrypoints
- run -- inline scripts submitted as Ray jobs
- script -- external scripts submitted as Ray jobs
- call -- Python functions with object store integration
Choosing an Executor
| Use Case | Recommended Executor |
|---|---|
| Development / CI | Local |
| HPC cluster (static allocation) | SLURM |
| Cloud / elastic GPU clusters | Ray |
| ML pipelines with in-memory passing | Ray |
| Kubernetes-native environments | K8s (planned) |
Mixed-Executor DAGs
OxyMake owns the DAG; executors are job-dispatch backends. A future enhancement will allow per-rule executor assignment, enabling mixed-executor DAGs where some rules run locally and others dispatch to Ray or SLURM.
Next Steps
- Execution Modes -- the four ways rules execute
- Materialization Policy -- controlling disk I/O
- Configuration -- project settings
OxyMake × Ray Deep Dive
OxyMake and Ray solve different halves of the distributed compute problem. OxyMake owns the what: which jobs to run, in what order, and what can be skipped. Ray owns the where: which machine, which GPU, how many cores. This page explains how the two systems fit together.
The Three Graphs Meet Ray
Before any executor sees a job, OxyMake transforms the user's declarations through three graph representations. Understanding this pipeline is essential for understanding what Ray actually receives.
Graph Transformation Pipeline
flowchart TD
A["Oxymakefile.toml<br/><i>Declarative TOML</i>"] --> B["RuleGraph<br/><i>Abstract: wildcards intact</i>"]
B -->|"Wildcard resolution<br/>+ guard evaluation"| C["JobGraph<br/><i>Concrete: every job instance</i>"]
C -->|"Optimization passes"| D["Optimized JobGraph"]
D -->|"Cache pruning removes<br/>up-to-date jobs"| E["Uncached Subgraph"]
E -->|"generate_driver()"| F["Python Driver Script<br/><i>@ray.remote tasks +<br/>ObjectRef chaining</i>"]
F -->|"Ray Jobs API<br/>POST /api/jobs/"| G["Ray Cluster<br/><i>Distributed execution</i>"]
style A fill:#f9f,stroke:#333
style F fill:#ff9,stroke:#333
style G fill:#9ff,stroke:#333
RuleGraph — What You Wrote
The RuleGraph is the abstract view: each rule is a node, wildcards are
unresolved. A single features rule represents ALL feature instances.
data ──→ features ──→ call ──→ annotate
JobGraph — What Will Execute
After wildcard resolution, each concrete job is a separate node. With 3
cohorts and 4 windows, a single features rule becomes 12 concrete
jobs. The JobGraph is bipartite — job nodes and output nodes alternate:
graph LR
subgraph "Bipartite JobGraph"
J1["job: align-A"] -->|produces| O1["output: results/A.bam"]
O1 -->|consumed by| J2["job: sort-A"]
J2 -->|produces| O2["output: results/A.sorted.bam"]
J3["job: align-B"] -->|produces| O3["output: results/B.bam"]
O3 -->|consumed by| J4["job: sort-B"]
J4 -->|produces| O4["output: results/B.sorted.bam"]
end
Optimization Passes
Before any executor sees the graph, OxyMake runs optimization passes:
| Pass | Effect |
|---|---|
| Cache pruning | Marks up-to-date jobs as "skip" |
| Task fusion | Merges sequential call-mode jobs into one |
| Materialization elimination | Removes unnecessary disk I/O |
| Critical path analysis | Annotates the longest chain for priority |
These passes run internally. ox plan reports the jobs that remain after
pruning, in the standard plan format -- for a large, mostly-cached pipeline:
Plan: 12 rules, 847 jobs, 1203 source files
Only the uncached subgraph is sent to Ray.
Ray Job Packaging
Why One Ray Job, Not N
OxyMake could submit each task as a separate Ray job. Instead, it generates
a single Python driver script that encodes the entire uncached DAG as
@ray.remote tasks with ObjectRef dependency chaining.
flowchart LR
subgraph "OxyMake (Rust)"
A["Optimized JobGraph<br/>847 uncached jobs"] -->|"driver_script.rs<br/>generate_driver()"| B["driver.py<br/>~500 lines"]
end
subgraph "Ray Cluster"
B -->|"Jobs API<br/>1 submission"| C["Ray Driver Process"]
C --> D["@ray.remote task 1"]
C --> E["@ray.remote task 2"]
C --> F["@ray.remote task 3"]
C --> G["..."]
C --> H["@ray.remote task N"]
D -.->|ObjectRef| E
D -.->|ObjectRef| F
E -.->|ObjectRef| H
F -.->|ObjectRef| H
end
style B fill:#ff9,stroke:#333
style C fill:#9ff,stroke:#333
Benefits of single-job packaging:
| Benefit | Why |
|---|---|
| Fire-and-forget | Submit once, Ray handles all scheduling |
| ObjectRef chaining | Upstream outputs become implicit dependencies |
| Ray parallelism | Ray's internal scheduler optimizes task placement |
| Cascading cancel | ray job stop cascades to all tasks |
| Dashboard visibility | One job with N tasks and a colored progress bar |
| Reduced API load | One HTTP submission instead of hundreds |
Generated Driver Structure
The Rust code in ox-exec-ray/src/driver_script.rs generates Python that
looks like this:
import ray
import subprocess
import time
import json
ray.init()
@ray.remote
def run_shell(job_id, command, work_dir, *deps):
"""Run a shell command. *deps are ObjectRefs — Ray waits for them."""
result = subprocess.run(command, shell=True, cwd=work_dir, ...)
if result.returncode != 0:
raise RuntimeError(f"Job {job_id} failed")
return result.returncode
@ray.remote
def run_call(job_id, module, func_name, *deps):
"""Run a call-mode function with object store integration."""
# ray.get() inputs from object store
# invoke function
# ray.put() outputs back to object store
...
# --- DAG encoded as ObjectRef chain ---
# Topological order, upstream refs passed as implicit dependencies
ref_0 = run_shell.options(num_cpus=8).remote(
"align-A", "bwa mem ...", "/project"
)
ref_1 = run_shell.options(num_cpus=2).remote(
"sort-A", "samtools sort ...", "/project",
ref_0 # ← dependency: Ray won't start until ref_0 completes
)
ref_2 = run_shell.options(num_cpus=8).remote(
"align-B", "bwa mem ...", "/project"
)
ref_3 = run_call.options(num_cpus=4, num_gpus=1).remote(
"train", "pipeline.model", "train",
ref_1, ref_2 # ← depends on both sort-A and align-B
)
# --- Collect results ---
results = {}
for ref, job_id in [(ref_0, "align-A"), (ref_1, "sort-A"), ...]:
try:
ray.get(ref)
results[job_id] = {"status": "completed"}
except Exception as e:
results[job_id] = {"status": "failed", "error": str(e)}
# Write manifest for ox status
with open("results.json", "w") as f:
json.dump(results, f)
The Ray dashboard shows this as 1 job with a task-level progress bar:
Ray Dashboard → Jobs → raysubmit_abc123
Tasks: ████████░░░░░░ 127/847 (15%)
Running: 16 | Pending: 704 | Completed: 127
Call Mode and the Ray Object Store
This is where OxyMake and Ray truly complement each other. In call mode,
OxyMake manages I/O outside the function — and on the Ray executor, that
I/O goes through Ray's distributed object store instead of disk.
Data Flow: Shell vs Call vs Ray-Call
flowchart TB
subgraph "Shell Mode (any executor)"
S1["Job A"] -->|"write file<br/>results/A.csv"| SD[("Disk")]
SD -->|"read file<br/>results/A.csv"| S2["Job B"]
end
subgraph "Call Mode (local executor)"
C1["Job A<br/><i>compute_features(df)</i>"] -->|"Arrow IPC<br/>in-process"| C2["Job B<br/><i>train_model(features)</i>"]
end
subgraph "Call Mode (Ray executor)"
R1["Job A<br/><i>@ray.remote</i>"] -->|"ray.put()<br/>→ object store"| RO[("Ray Object<br/>Store")]
RO -->|"ray.get()<br/>zero-copy"| R2["Job B<br/><i>@ray.remote</i>"]
end
style SD fill:#fcc,stroke:#333
style RO fill:#cfc,stroke:#333
| Mode | Data between stages | Disk I/O | Best for |
|---|---|---|---|
| Shell (any executor) | Files on disk | Always | CLI tools, legacy scripts |
| Call (local executor) | Arrow IPC, in-process | Optional (materialization policy) | Single-node data pipelines |
| Call (Ray executor) | ray.put()/ray.get(), object store | Optional (materialization policy) | Distributed data pipelines |
How Ray Call Mode Works
When a call-mode job runs on the Ray executor, OxyMake generates a wrapper
script (via call_mode.rs) that integrates with the object store:
sequenceDiagram
participant D as Driver Script
participant OS as Ray Object Store
participant W as Worker (call-mode task)
participant FS as Shared Filesystem
D->>OS: ray.put(input_data)
Note over D: ObjectRef stored
D->>W: run_call.remote(job_id, module, func, input_ref)
W->>OS: ray.get(input_ref)
Note over W: Zero-copy if same node
W->>W: result = func(input_data)
W->>OS: ray.put(result)
Note over W: ObjectRef returned
alt materialize = "always" or "final" (leaf)
W->>FS: write result to disk
end
D->>D: Pass ObjectRef to downstream tasks
Materialization Policies on Ray
OxyMake's materialization policies map directly to Ray behavior:
| Policy | Object Store | Disk Write | Use Case |
|---|---|---|---|
always | Yes | Yes | Debugging, external tools need files |
auto | Yes | Only if downstream needs a file | Default — let OxyMake decide |
never | Yes (evicted after consumers finish) | No | Pure intermediates, save disk |
final | Yes | Only for DAG leaves | Pipeline outputs to disk, intermediates in memory |
Example rule with materialization:
[rule.compute_features]
input = [{ path = "data/{sample}.parquet", format = "parquet" }]
output = [{ path = "features/{sample}.parquet", format = "parquet", materialize = "auto" }]
call = "pipeline.features:compute_features"
lang = "python"
resources = { cpu = 4, mem_gb = 8 }
With materialize = "auto" on the Ray executor, the features DataFrame
lives in the Ray object store. If the next rule is also call-mode on Ray,
data passes through the object store with zero disk I/O. If a downstream
rule is shell-mode and needs a file path, OxyMake automatically
materializes to disk.
The Bridge (ADR-008)
The ExecutorBridge trait formalizes the separation between OxyMake's
scheduler and remote executors. It defines three communication directions:
flowchart LR
subgraph "OxyMake (Rust)"
S["Scheduler<br/><i>DAG owner, cache, gates</i>"]
ST["ox status"]
end
subgraph "ExecutorBridge"
direction TB
SUB["SUBMIT<br/><i>submit_dag()</i><br/><i>map_resources()</i>"]
MON["MONITOR<br/><i>poll_dag_status()</i><br/><i>fetch_logs()</i><br/><i>sync_results()</i><br/><i>reconnect()</i>"]
CTL["CONTROL<br/><i>cancel_job()</i><br/><i>cancel_all()</i>"]
end
subgraph "Ray Cluster"
R["Ray Jobs API<br/><i>Driver + tasks</i>"]
end
S -->|"uncached subgraph"| SUB
SUB -->|"driver.py"| R
R -->|"status, logs"| MON
MON -->|"DagStatus"| ST
S -->|"cancel"| CTL
CTL -->|"ray job stop"| R
Separation of Concerns
| Concern | OxyMake (Scheduler) | Ray (Executor) |
|---|---|---|
| DAG construction | Parses Oxymakefile, resolves wildcards | -- |
| Cache checking | Content-addressable (blake3) | -- |
| Optimization | Cache pruning, task fusion, critical path | -- |
| Scheduling order | Topological sort, priority, gates | -- |
| Task placement | -- | Which node, which GPU |
| Resource allocation | -- | CPU, memory, GPU scheduling |
| Autoscaling | -- | Scale workers up/down |
| Object store | -- | Zero-copy data passing |
| Fault tolerance | Retry strategy (OxyMake-managed) | Worker failure detection |
State Synchronization
After submission, OxyMake stays connected via the bridge:
- Submit:
submit_dag()generates the driver script, submits to Ray Jobs API, writesmeta.jsonto.oxymake/runs/{run_id}/ - Poll:
poll_dag_status()queries Ray for per-task status, returnsDagStatuswith job-level completion info - Sync:
sync_results()writes job results (exit codes, durations, peak memory) back to OxyMake's state database - Reconnect: After an OxyMake crash,
reconnect()readsmeta.jsonand reconstructs a handle to the still-running Ray job
The meta.json contract:
{
"executor": "ray",
"version": 1,
"submitted_at": "2025-04-01T12:00:00Z",
"connection": {
"ray_address": "http://127.0.0.1:8265",
"ray_job_id": "raysubmit_abc123"
},
"run_id": "run-20250401-120000",
"total_jobs": 847,
"active_jobs": 847,
"skipped_jobs": 102582
}
Resource Mapping
OxyMake resources map to Ray resources via map_resources():
| OxyMake | Ray | Notes |
|---|---|---|
cpu | num_cpus | Direct mapping |
mem | memory | Bytes |
gpu | num_gpus | Fractional GPUs supported (gpu = 0.5) |
custom:tpu | Custom resource TPU | Arbitrary Ray custom resources |
Ray's advantage: fractional GPUs (num_gpus=0.5) enable model serving
workloads where multiple inference tasks share a single GPU.
Philosophy: Complementary, Not Overlapping
OxyMake and Ray solve orthogonal problems:
| Dimension | OxyMake | Ray |
|---|---|---|
| Core question | What to run? | Where to run it? |
| Key innovation | Content-addressable cache | Distributed object store |
| Configuration | Declarative TOML | Python API / YAML |
| DAG model | Three-level (Rule → Job → Exec) | Flat task graph |
| Cache | blake3 content hashing | None (execution-only) |
| Scheduling | Topological + priorities + gates | Resource-based bin packing |
| State | Persistent (state.db, cache) | Ephemeral (cluster lifetime) |
Why Not Snakemake + Ray or Airflow + Ray?
Snakemake + Ray: Snakemake's file-based cache uses timestamps, not content hashes. It has no materialization policies, no call mode, and its Python DSL prevents static analysis. Adding Ray to Snakemake gives you distributed execution but not the optimization pipeline (task fusion, materialization elimination) that makes the combination powerful.
Airflow + Ray: Airflow is an orchestrator that owns the DAG schedule. Adding Ray as an executor gives you distributed compute, but Airflow's DAG model is runtime-defined Python, not declarative TOML. You cannot inspect or optimize an Airflow DAG before execution.
OxyMake + Ray: OxyMake's declarative format enables static analysis and optimization passes before execution. Ray provides elastic compute and zero-copy data passing during execution. Neither system steps on the other's responsibilities.
flowchart LR
subgraph "OxyMake Responsibilities"
direction TB
A1["Parse Oxymakefile.toml"]
A2["Resolve wildcards"]
A3["Check content-addressable cache"]
A4["Optimize: fuse, prune, eliminate"]
A5["Generate driver script"]
A1 --> A2 --> A3 --> A4 --> A5
end
subgraph "Ray Responsibilities"
direction TB
B1["Receive driver script"]
B2["Schedule tasks on workers"]
B3["Manage object store"]
B4["Autoscale cluster"]
B5["Report task status"]
B1 --> B2 --> B3 --> B4 --> B5
end
A5 -->|"Ray Jobs API"| B1
B5 -->|"poll_dag_status()"| A5
style A5 fill:#ff9,stroke:#333
style B1 fill:#9ff,stroke:#333
Quick Start
1. Start a Ray cluster
ray start --head
# Dashboard: http://127.0.0.1:8265
2. Configure OxyMake
# Oxymakefile.toml
[executor.ray]
dashboard_address = "http://127.0.0.1:8265"
3. Run your workflow on Ray
ox run --executor ray
OxyMake handles caching, DAG optimization, and driver generation. Ray handles task placement, GPU scheduling, and data passing. Your workflow file does not change.
4. Monitor execution
ox status # OxyMake's view (aggregated)
# or visit Ray Dashboard for task-level detail
Further Reading
- Executors -- all available executors and configuration
- Execution Modes -- shell, run, script, call
- The Three Graphs -- RuleGraph, JobGraph, ExecGraph
- Materialization Policy -- controlling disk I/O
- Content-Addressable Cache -- how cache keys work
OxyMake × SLURM Deep Dive
OxyMake and SLURM solve different halves of the HPC workflow problem. OxyMake owns the what: which jobs to run, in what order, and what can be skipped. SLURM owns the where: which node, how many cores, how much memory. This page explains how the two systems fit together — from job packaging through monitoring to real-cluster deployment.
The Three Graphs Meet SLURM
Before any executor sees a job, OxyMake transforms the user's declarations through three graph representations. Understanding this pipeline is essential for understanding what SLURM actually receives.
Graph Transformation Pipeline
flowchart TD
A["Oxymakefile.toml<br/><i>Declarative TOML</i>"] --> B["RuleGraph<br/><i>Abstract: wildcards intact</i>"]
B -->|"Wildcard resolution<br/>+ guard evaluation"| C["JobGraph<br/><i>Concrete: every job instance</i>"]
C -->|"Optimization passes"| D["Optimized JobGraph"]
D -->|"Cache pruning removes<br/>up-to-date jobs"| E["Uncached Subgraph"]
E -->|"submit_dag()"| F["sbatch scripts<br/><i>Per-job or job arrays<br/>with --dependency chains</i>"]
F -->|"sbatch --parsable<br/>+ --dependency=afterok"| G["SLURM Scheduler<br/><i>slurmctld</i>"]
style A fill:#f9f,stroke:#333
style F fill:#ff9,stroke:#333
style G fill:#9ff,stroke:#333
Optimization Before Submission
Before any executor sees the graph, OxyMake runs optimization passes:
| Pass | Effect |
|---|---|
| Cache pruning | Marks up-to-date jobs as "skip" |
| Task fusion | Merges sequential call-mode jobs into one |
| Materialization elimination | Removes unnecessary disk I/O |
| Critical path analysis | Annotates the longest chain for priority |
Only the uncached subgraph is submitted to SLURM. After pruning, ox plan
reports the jobs that remain, in the standard plan format -- for a large,
mostly-cached pipeline:
Plan: 12 rules, 847 jobs, 1203 source files
SLURM Job Packaging
Two Submission Modes
OxyMake supports two SLURM submission strategies, chosen automatically:
flowchart TB
subgraph "OxyMake (Rust)"
A["Optimized JobGraph<br/>847 uncached jobs"]
end
A --> DECIDE{"Same rule,<br/>many wildcards?"}
DECIDE -->|"Yes"| ARRAY["Job Array<br/><i>1 sbatch + N tasks</i>"]
DECIDE -->|"No"| INDIVIDUAL["Individual Jobs<br/><i>N sbatch calls with<br/>--dependency=afterok chains</i>"]
subgraph "SLURM Cluster"
ARRAY --> SC["slurmctld"]
INDIVIDUAL --> SC
SC --> C1["c1"]
SC --> C2["c2"]
SC --> CN["..."]
end
style A fill:#ff9,stroke:#333
style SC fill:#9ff,stroke:#333
Mode 1: Individual jobs with --dependency=afterok chains.
Each job gets its own sbatch script. Upstream dependencies are encoded
as --dependency=afterok:JOBID1:JOBID2. Jobs are submitted in
topological order so that upstream SLURM IDs are known before downstream
jobs reference them. Cached upstream jobs are omitted — their outputs
already exist on the shared filesystem, so no SLURM dependency is needed.
Mode 2: Job arrays for wildcard-expanded rules.
When a single rule (e.g., process) expands to many concrete jobs via
wildcards, OxyMake packages them as a single SLURM job array. One
sbatch call submits all tasks. Each task reads its parameters from a
JSON-lines file indexed by SLURM_ARRAY_TASK_ID.
Why --dependency=afterok Chains?
Unlike the Ray executor (which generates a single driver script), the
SLURM executor submits one sbatch per job (or job array) and lets
SLURM's own scheduler enforce ordering:
| Benefit | Why |
|---|---|
| Native SLURM scheduling | slurmctld handles priority, backfill, preemption |
| Cluster-native visibility | Every job appears in squeue and sacct |
| Granular accounting | Per-job CPU time, memory, node assignment |
| Standard cancellation | scancel works on individual jobs |
| Fair-share integration | Jobs participate in the cluster's fair-share scheduler |
Generated Job Script Structure
The Rust code in ox-exec-slurm/src/job_script.rs generates bash scripts
that look like this:
#!/bin/bash
#SBATCH --job-name=ox_process_j-042
#SBATCH --output=/scratch/staging/run-001/j-042/slurm-%j.out
#SBATCH --error=/scratch/staging/run-001/j-042/slurm-%j.err
#SBATCH --partition=gpu
#SBATCH --account=my-lab
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --gpus=1
#SBATCH --time=01:06:00
# --- Environment setup ---
set -euo pipefail
module load conda 2>/dev/null || true
eval "$(conda shell.bash hook)"
conda activate ml-env
# --- Working directory ---
cd "/data/projects/my-pipeline"
# --- Execute command ---
python train.py --sample=s01 --output=results/s01.parquet
Key design decisions:
cdto project directory (not staging dir) so that relative output paths resolve to the same locations as the local executor — essential for cache correctness.- Job name truncated to 255 characters (SLURM's limit).
set -euo pipefailso failures propagate immediately.--timederived from job timeout with a 10% buffer if not explicitly set via thetimeresource.
Job Array Script Structure
For wildcard-expanded rules, OxyMake generates an array script with a parameter file:
#!/bin/bash
#SBATCH --job-name=ox_array_align
#SBATCH --array=0-4%2
#SBATCH --output=/scratch/staging/slurm-%A_%a.out
#SBATCH --error=/scratch/staging/slurm-%A_%a.err
#SBATCH --partition=gpu
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
# --- Environment setup ---
set -euo pipefail
# --- Working directory ---
cd "/data/projects/pipeline"
# --- Array task dispatch ---
PARAMS_FILE="$(dirname "$0")/array_params.jsonl"
TASK_LINE=$(sed -n "$((SLURM_ARRAY_TASK_ID + 1))p" "$PARAMS_FILE")
# Export wildcard values as environment variables
export OX_JOB_ID=$(echo "$TASK_LINE" | python3 -c 'import sys,json; print(json.load(sys.stdin)["job_id"])')
export OX_WC_sample=$(echo "$TASK_LINE" | python3 -c 'import sys,json; print(json.load(sys.stdin)["wildcards"]["sample"])')
# --- Execute command ---
TASK_CMD=$(echo "$TASK_LINE" | python3 -c 'import sys,json; print(json.load(sys.stdin)["command"])')
eval "$TASK_CMD"
The companion array_params.jsonl:
{"index":0,"job_id":"j-1","wildcards":{"sample":"A"},"command":"bwa mem -t 8 ref.fa data/A.fq > results/A.bam"}
{"index":1,"job_id":"j-2","wildcards":{"sample":"B"},"command":"bwa mem -t 8 ref.fa data/B.fq > results/B.bam"}
{"index":2,"job_id":"j-3","wildcards":{"sample":"C"},"command":"bwa mem -t 8 ref.fa data/C.fq > results/C.bam"}
The %2 suffix in --array=0-4%2 throttles to 2 concurrent tasks
(configurable via job_array.max_concurrent).
The Bridge (ADR-008)
The Executor trait formalizes the separation between OxyMake's
scheduler and remote executors. The SLURM executor implements these
communication directions:
flowchart LR
subgraph "OxyMake (Rust)"
S["Scheduler<br/><i>DAG owner, cache, gates</i>"]
ST["ox status"]
end
subgraph "Executor Trait"
direction TB
INIT["INIT<br/><i>init(), health_check()</i>"]
SUB["SUBMIT<br/><i>submit_dag(), execute()</i>"]
MON["MONITOR<br/><i>poll_status()</i>"]
CTL["CONTROL<br/><i>cancel(), cleanup()</i>"]
end
subgraph "SLURM Cluster"
R["slurmctld<br/><i>sbatch, sacct, squeue</i>"]
end
S -->|"uncached subgraph"| SUB
SUB -->|"sbatch --parsable"| R
R -->|"sacct/squeue"| MON
MON -->|"JobStatus"| ST
S -->|"cancel"| CTL
CTL -->|"scancel"| R
INIT -->|"sinfo --version"| R
Separation of Concerns
| Concern | OxyMake (Scheduler) | SLURM (Executor) |
|---|---|---|
| DAG construction | Parses Oxymakefile, resolves wildcards | -- |
| Cache checking | Content-addressable (blake3) | -- |
| Optimization | Cache pruning, task fusion, critical path | -- |
| Job packaging | Generates sbatch scripts, dependency chains | -- |
| Task placement | -- | Which node, backfill scheduling |
| Resource allocation | -- | CPU, memory, GPU, GRES scheduling |
| Fair-share | -- | Multi-user priority, QOS enforcement |
| Node management | Failed node exclusion list | Node health, drain/resume |
State Synchronization
After submission, OxyMake stays connected via adaptive polling:
sequenceDiagram
participant OX as OxyMake Scheduler
participant FS as Shared Filesystem
participant SC as slurmctld
participant C as Compute Nodes
OX->>FS: Write sbatch scripts to staging_dir
OX->>SC: sbatch --parsable job.sh
SC-->>OX: 12345 (SLURM job ID)
OX->>FS: Write meta.json
loop Adaptive Polling (5s-60s)
OX->>SC: sacct -j 12345 --parsable2
SC-->>OX: 12345|RUNNING|0:0|512M|00:05:30|c1
Note over OX: State change → reset backoff
end
alt sacct unavailable
OX->>SC: squeue -j 12345 -h -o %T
SC-->>OX: RUNNING
end
alt Job terminal (COMPLETED/FAILED)
OX->>OX: Map SLURM state → JobResult
OX->>FS: Collect slurm-*.out/err logs
OX->>FS: Clean staging directory
end
Adaptive backoff prevents overloading slurmctld:
- Start at 5 seconds (configurable via
poll_interval_min) - Multiply by 1.5× each poll with no state change
- Cap at 60 seconds (configurable via
poll_interval_max) - Reset to minimum on any state change
- Batch queries:
sacct -j id1,id2,...,idN— one call for all jobs
The meta.json contract:
{
"executor": "slurm",
"version": 1,
"run_id": "run-20250401-120000",
"total_jobs": 847,
"active_jobs": 847,
"skipped_jobs": 102582,
"job_mapping": {
"align-A": "12345",
"align-B": "12346",
"sort-A": "12347"
}
}
Resource Mapping
OxyMake resources map to SLURM #SBATCH directives via resource_mapper.rs:
| OxyMake | SLURM | Notes |
|---|---|---|
cpu | --cpus-per-task | Per-task CPU cores |
mem | --mem | Total memory per node (e.g., "8G") |
mem_mb | --mem | Memory in MB (auto-appends M suffix) |
mem_per_cpu | --mem-per-cpu | Memory per CPU core |
gpu | --gpus | GPU count |
gres | --gres | Generic resources (e.g., "gpu:2") |
nodes | --nodes | Node count (multi-node jobs) |
tasks | --ntasks | MPI task count |
ntasks_per_node | --ntasks-per-node | Tasks per node |
partition | --partition | SLURM partition |
time | --time | Wall time limit (HH:MM:SS) |
qos | --qos | Quality of Service |
Mutual exclusion: --mem and --mem-per-cpu cannot both be specified.
OxyMake validates this at submission time and returns a clear error.
Timeout derivation: If no explicit time resource is set but the job
has a timeout, OxyMake derives --time with a 10% buffer. A 1-hour
timeout becomes --time=01:06:00.
[rule.train]
output = ["model/weights.pt"]
resources = { cpu = 8, mem = "32G", gpu = 2, time = "4:00:00" }
environment = { conda = "torch-env" }
shell = "python train.py --epochs=100"
SLURM Job States
SLURM reports over a dozen job states. OxyMake maps them to four:
stateDiagram-v2
[*] --> Queued: sbatch accepted
Queued --> Running: Resources allocated
Running --> Completed: Exit code 0
Running --> Failed: Non-zero exit
Running --> Failed: TIMEOUT
Running --> Failed: OUT_OF_MEMORY
Running --> Failed: NODE_FAIL
Running --> Cancelled: scancel / PREEMPTED
state Queued {
PENDING
REQUEUED
SUSPENDED
CONFIGURING
}
state Running {
RUNNING_STATE: RUNNING
COMPLETING
RESIZING
}
state Failed {
FAILED_STATE: FAILED
TIMEOUT_STATE: TIMEOUT
OOM: OUT_OF_MEMORY
NODE_FAIL_STATE: NODE_FAIL
BOOT_FAIL
DEADLINE
}
state Cancelled {
CANCELLED_STATE: CANCELLED
PREEMPTED_STATE: PREEMPTED
REVOKED
}
Failed Node Exclusion
When a job reports NODE_FAIL or BOOT_FAIL, OxyMake:
- Queries
sacctfor the failing node's hostname - Adds it to an in-memory exclusion set
- Passes
--exclude=node1,node2on all futuresbatchsubmissions - Reports excluded nodes when the workflow completes
This prevents cascading failures from bad hardware without requiring manual intervention.
Monitoring: sacct Primary, squeue Fallback
Status polling uses a two-tier strategy:
flowchart TD
START["Poll job status"] --> SACCT["sacct -j ID --parsable2"]
SACCT --> SACCT_OK{"Records<br/>found?"}
SACCT_OK -->|"Yes"| PARSE["Parse state,<br/>exit code, memory,<br/>elapsed, node"]
SACCT_OK -->|"No (empty or failed)"| SQUEUE["squeue -j ID -h -o %T"]
SQUEUE --> SQ_OK{"Job in<br/>queue?"}
SQ_OK -->|"Yes"| RUNNING["Report as<br/>Running/Queued"]
SQ_OK -->|"No"| RETRY["Wait 2s,<br/>retry sacct"]
RETRY --> RETRY_OK{"Found<br/>now?"}
RETRY_OK -->|"Yes"| PARSE
RETRY_OK -->|"No"| LOST["Report as<br/>JobNotFound"]
PARSE --> TERMINAL{"Terminal<br/>state?"}
TERMINAL -->|"Yes"| RESULT["Return JobResult<br/>(exit code, duration,<br/>peak memory, node)"]
TERMINAL -->|"No"| BACKOFF["Adaptive backoff<br/>(5s → 60s)"]
BACKOFF --> START
Why the fallback? Some HPC clusters don't have slurmdbd
(the SLURM accounting daemon) configured, making sacct unavailable.
squeue always works but provides less information (no exit codes,
no memory stats, no elapsed time for completed jobs).
The 2-second retry handles a race condition: a job can vanish from
squeue (it finished) before sacct has ingested the accounting record.
Docker Setup: Containerized SLURM Cluster
OxyMake ships a Docker Compose setup for local testing and CI:
graph TB
subgraph "docker-compose.yml"
MYSQL["mysql<br/><i>MariaDB 10.11</i><br/>Port 3306"]
DBD["slurmdbd<br/><i>Accounting daemon</i><br/>Port 6819"]
CTL["slurmctld<br/><i>Controller</i><br/>Port 6817"]
REST["slurmrestd<br/><i>REST API gateway</i><br/>Port 6820"]
C1["c1<br/><i>Compute node</i>"]
C2["c2<br/><i>Compute node</i>"]
end
SHARED[("/work<br/><i>Shared volume</i>")]
DATA[("/data/lab<br/><i>Host bind mount</i>")]
JWT[("shared-slurm<br/><i>JWT key volume</i>")]
MYSQL --> DBD
DBD --> CTL
CTL --> REST
CTL --> C1
CTL --> C2
JWT --- CTL
JWT --- REST
JWT --- DBD
SHARED --- CTL
SHARED --- C1
SHARED --- C2
DATA --- CTL
DATA --- C1
DATA --- C2
style SHARED fill:#cfc,stroke:#333
style DATA fill:#cfc,stroke:#333
style JWT fill:#ff9,stroke:#333
style REST fill:#9ff,stroke:#333
Start the Cluster
cd tests/slurm-docker
docker compose up -d
# Wait ~20 seconds for all services to initialize
docker compose exec slurmctld sinfo -N -h
# Output:
# c1 normal idle
# c2 normal idle
Cluster Configuration
The slurm.conf defines a minimal 2-node cluster:
ClusterName=oxymake-demo
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
AccountingStorageType=accounting_storage/slurmdbd
NodeName=c[1-2] CPUs=2 RealMemory=2048 State=UNKNOWN
PartitionName=normal Nodes=c[1-2] Default=YES MaxTime=INFINITE State=UP
Key settings:
select/cons_treswithCR_Core: Consumable resources at the core level — each job gets exactly the cores it requests.sched/backfill: Allows smaller jobs to start while larger jobs wait for resources, improving utilization.slurmdbdwith MariaDB: Full accounting sosacctworks.
JWT Authentication Setup
The Docker cluster configures JWT authentication automatically:
- slurmctld generates a random 256-bit key at startup
(
/etc/slurm/jwt_hs256.key) - The key is shared via the
shared-slurmDocker volume - slurmdbd and slurmrestd pick up the key and add
AuthAltTypes=auth/jwtto their configuration - Clients authenticate with
X-SLURM-USER-TOKEN(JWT) andX-SLURM-USER-NAMEheaders
Generate a token for local testing:
# Generate a JWT token for user "root" (valid 1 hour)
docker compose exec slurmctld scontrol token lifespan=3600
# Output: SLURM_JWT=eyJhbGciOi...
export SLURM_JWT=eyJhbGciOi...
Port Mapping
| Port | Service | Purpose |
|---|---|---|
| 6817 | slurmctld | SLURM controller API |
| 6819 | slurmdbd | Accounting database daemon |
| 6820 | slurmrestd | REST API gateway (HTTP/JSON) |
| 3306 | mysql | MariaDB (slurmdbd backend) |
Submit a Test Job
docker compose exec slurmctld bash -c '
echo "#!/bin/bash
hostname
date
sleep 5
echo done" > /work/test.sh && sbatch /work/test.sh'
# Output: Submitted batch job 1
# Check status:
docker compose exec slurmctld sacct --parsable2 --noheader -o JobID,State,ExitCode
# Output: 1|COMPLETED|0:0
Teardown
docker compose down -v # Remove containers and volumes
Two Modes: CLI vs REST API
Mode 1: CLI (sbatch / sacct)
The default and most common mode. OxyMake shells out to SLURM CLI
commands. This works on any cluster where the user has SLURM in their
$PATH:
# OxyMake internally runs:
sbatch --parsable job.sh # Submit → returns job ID
sacct -j 12345 --parsable2 -o JobID,State # Poll status
scancel 12345 # Cancel if needed
Pros: Works everywhere, no extra setup, respects Munge auth. Cons: One process spawn per command, rate limiting required at scale.
Mode 2: REST API (slurmrestd)
For programmatic access, SLURM provides slurmrestd — an HTTP/JSON
gateway to the same operations:
# Start slurmrestd (typically done by the cluster admin)
slurmrestd -a rest_auth/local 0.0.0.0:6820
# Submit a job via HTTP
curl -X POST http://slurmctld:6820/slurm/v0.0.44/job/submit \
-H "Content-Type: application/json" \
-H "X-SLURM-USER-NAME: $USER" \
-H "X-SLURM-USER-TOKEN: $SLURM_JWT" \
-d '{
"script": "#!/bin/bash\nhostname\ndate",
"job": {
"name": "ox_test",
"partition": "normal",
"cpus_per_task": 4,
"memory_per_node": { "number": 8, "set": true, "infinite": false },
"tasks": 1
}
}'
# Poll status
curl http://slurmctld:6820/slurm/v0.0.44/job/12345 \
-H "X-SLURM-USER-NAME: $USER" \
-H "X-SLURM-USER-TOKEN: $SLURM_JWT"
# Cancel
curl -X DELETE http://slurmctld:6820/slurm/v0.0.44/job/12345 \
-H "X-SLURM-USER-NAME: $USER" \
-H "X-SLURM-USER-TOKEN: $SLURM_JWT"
Pros: No process spawning, structured JSON responses, lower latency
at scale.
Cons: Requires slurmrestd to be running, JWT authentication setup,
not universally available.
Both modes are supported. CLI mode is the default. To use REST mode, pass
--slurm-api http://host:6820(or setSlurmConfig::api_url). Authentication usesX-SLURM-USER-NAME(from$USER) andX-SLURM-USER-TOKEN(from$SLURM_JWT, optional).
REST API Flow
The full lifecycle of a job submitted via REST mode:
sequenceDiagram
participant OX as OxyMake<br/>(Rust)
participant REST as slurmrestd<br/>:6820
participant CTL as slurmctld
participant C as Compute Nodes<br/>(c1, c2)
Note over OX: Generate job script<br/>+ write to staging_dir
OX->>REST: POST /slurm/v0.0.44/job/submit<br/>Headers: X-SLURM-USER-NAME, X-SLURM-USER-TOKEN<br/>Body: {script, job: {name, partition, cpus, mem}}
REST->>CTL: Internal SLURM protocol
CTL-->>REST: job_id: 12345
REST-->>OX: {"job_id": 12345}
loop Adaptive Polling (5s–60s)
OX->>REST: GET /slurm/v0.0.44/job/12345
REST->>CTL: Query job state
CTL-->>REST: Job state + metadata
REST-->>OX: {"job_state": "RUNNING", ...}
end
CTL->>C: Dispatch job to node
C->>C: Execute sbatch script
C-->>CTL: Exit code 0
OX->>REST: GET /slurm/v0.0.44/job/12345
REST-->>OX: {"job_state": "COMPLETED", "exit_code": 0}
alt Cancel needed
OX->>REST: DELETE /slurm/v0.0.44/job/12345
REST->>CTL: scancel 12345
end
Environment requirement: Unlike CLI sbatch (which inherits the
submitter's shell environment), the REST API starts with an empty
environment. OxyMake injects default PATH and HOME variables to
ensure scripts can find basic utilities.
The Bridge: OxyMake DAG → sbatch Dependency Chain
The core translation from OxyMake's DAG to SLURM's execution model:
flowchart LR
subgraph "OxyMake DAG"
direction TB
A["generate-s01"]
B["generate-s02"]
C["process-s01"]
D["process-s02"]
E["merge"]
F["report"]
A --> C
B --> D
C --> E
D --> E
E --> F
end
subgraph "SLURM Submission"
direction TB
SA["sbatch generate-s01.sh<br/>→ SLURM 100"]
SB["sbatch generate-s02.sh<br/>→ SLURM 101"]
SC["sbatch --dependency=afterok:100<br/>process-s01.sh → SLURM 102"]
SD["sbatch --dependency=afterok:101<br/>process-s02.sh → SLURM 103"]
SE["sbatch --dependency=afterok:102:103<br/>merge.sh → SLURM 104"]
SF["sbatch --dependency=afterok:104<br/>report.sh → SLURM 105"]
end
A -.-> SA
B -.-> SB
C -.-> SC
D -.-> SD
E -.-> SE
F -.-> SF
Topological submission: Jobs are submitted in topological order.
When OxyMake submits process-s01, it already knows that generate-s01
was assigned SLURM ID 100, so it can add --dependency=afterok:100.
Cached jobs are transparent: If generate-s01 is cached (outputs
exist and are up-to-date), it is never submitted to SLURM. When
process-s01 is submitted, its dependency list omits the cached job
entirely — the outputs are already on the shared filesystem.
Environment Support on HPC
SLURM clusters have unique environment constraints:
Conda / Module System
HPC clusters use module load for software management. OxyMake generates
the appropriate setup:
[rule.train.environment]
conda = "torch-env"
Generates:
module load conda 2>/dev/null || true
eval "$(conda shell.bash hook)"
conda activate torch-env
Apptainer (Not Docker)
Most HPC clusters prohibit Docker (requires root). When a Docker environment is specified with the SLURM executor, OxyMake automatically falls back to Apptainer:
[rule.inference.environment]
docker = "nvcr.io/nvidia/pytorch:24.01-py3"
Generates:
# WARNING: Docker not supported on most HPC clusters.
# Consider using Apptainer (environment = { type = "apptainer", ... }).
apptainer exec nvcr.io/nvidia/pytorch:24.01-py3
For explicit Apptainer support:
[rule.inference.environment]
apptainer = "/shared/images/pytorch-24.01.sif"
Shared Filesystem Constraint
All data — job scripts, inputs, outputs — must live on a filesystem visible to both the scheduling node and compute nodes:
flowchart LR
subgraph "Login / Submit Node"
OX["ox run<br/>--executor slurm"]
DB[("state.db<br/><i>Local disk only<br/>(SQLite WAL)</i>")]
end
subgraph "Shared Filesystem<br/>(NFS / Lustre / GPFS)"
STAGE["staging_dir/<br/><i>sbatch scripts</i>"]
DATA["project/<br/><i>inputs + outputs</i>"]
end
subgraph "Compute Nodes"
C1["c1: slurmd"]
C2["c2: slurmd"]
end
OX --> DB
OX -->|"write scripts"| STAGE
STAGE -->|"read scripts"| C1
STAGE -->|"read scripts"| C2
C1 -->|"read/write"| DATA
C2 -->|"read/write"| DATA
style DB fill:#fcc,stroke:#333
style STAGE fill:#cfc,stroke:#333
style DATA fill:#cfc,stroke:#333
Critical constraint: state.db uses SQLite WAL mode, which does
not work on network filesystems (NFS, Lustre, GPFS). The ox run
process must execute on a node with local disk. Compute nodes never
access state.db — they only read sbatch scripts and read/write data
files on the shared filesystem.
Configuration
Configure the SLURM executor in .oxymake/config.toml or Oxymakefile.toml:
[executor.slurm]
partition = "gpu"
account = "my-lab"
qos = "high"
staging_dir = "/scratch/oxymake"
max_submit = 100
poll_interval_min = "5s"
poll_interval_max = "60s"
extra_flags = ["--mail-type=FAIL", "--mail-user=user@lab.edu"]
[executor.slurm.job_array]
enabled = true
max_array_size = 1000
max_concurrent = 50
| Setting | Default | Description |
|---|---|---|
partition | cluster default | SLURM partition |
account | none | Account for resource accounting |
qos | none | Quality of Service |
staging_dir | /tmp/oxymake-slurm | Directory for scripts + logs (must be shared) |
max_submit | unlimited | Max concurrent submitted jobs (rate limiting) |
poll_interval_min | 5s | Minimum adaptive poll interval |
poll_interval_max | 60s | Maximum adaptive poll interval |
extra_flags | [] | Additional #SBATCH flags (passed through verbatim) |
job_array.enabled | true | Use job arrays for wildcard expansions |
job_array.max_array_size | unlimited | Maximum tasks per array |
job_array.max_concurrent | unlimited | Max concurrent array tasks (%N throttle) |
Switching to a Real Cluster
Moving from the Docker test cluster to a production HPC environment:
Grid'5000
[executor.slurm]
partition = "default"
staging_dir = "/home/$USER/oxymake-staging"
extra_flags = ["--reservation=my-reservation"]
# On a Grid'5000 frontend:
oarsub -I -t deploy -l nodes=4,walltime=2:00:00
# Then inside the reservation:
ox run --executor slurm -j 16
IDRIS (Jean Zay)
[executor.slurm]
partition = "gpu_p13"
account = "abc@v100"
qos = "qos_gpu-t3"
staging_dir = "$WORK/oxymake-staging"
extra_flags = ["--hint=nomultithread"]
# On Jean Zay:
module load python/3.11 cuda/12.1
ox run --executor slurm
GCP + Slurm-GCP
[profile.gcloud]
executor = "slurm"
partition = "batch"
account = "default"
jobs = 100
[profile.gcloud-gpu]
executor = "slurm"
partition = "gpu"
account = "default"
jobs = 20
ox run --profile gcloud
Google Cloud's HPC Toolkit deploys a SLURM cluster with autoscaling — nodes spin up on demand when jobs enter the queue and spin down when idle. The Filestore NFS mount provides the shared filesystem required by OxyMake's SLURM executor.
For a full setup guide including cluster provisioning, SSH tunneling, and cost control, see the Cloud HPC cookbook, which works a Google Cloud cluster as one concrete example.
Common Pitfalls
| Pitfall | Solution |
|---|---|
| Polling too fast | Use adaptive backoff (5s minimum). Aggressive 1s polls can get you rate-limited or banned from HPC clusters. |
| state.db on NFS | Run ox run on a node with local disk. SQLite WAL mode fails on network filesystems. |
Forgetting --parsable | OxyMake always uses sbatch --parsable — raw output format varies by SLURM version and locale. |
| Job name too long | Truncated automatically to 255 characters. |
| Docker on HPC | OxyMake warns and substitutes apptainer exec. Use Apptainer explicitly. |
| sacct field truncation | OxyMake uses --parsable2 which avoids field-width truncation. |
| sacct job step noise | OxyMake filters to main job entries only (skips 12345.batch, 12345.0). |
| Exit code format | sacct returns exit:signal (e.g., 137:9). OxyMake parses only the first number. |
| mem + mem_per_cpu conflict | OxyMake validates mutual exclusion at submission time with a clear error. |
Philosophy: Complementary, Not Overlapping
OxyMake and SLURM solve orthogonal problems:
| Dimension | OxyMake | SLURM |
|---|---|---|
| Core question | What to run? | Where to run it? |
| Key innovation | Content-addressable cache | Fair-share batch scheduler |
| Configuration | Declarative TOML | slurm.conf + sbatch flags |
| DAG model | Three-level (Rule → Job → Exec) | Flat job queue + dependencies |
| Cache | blake3 content hashing | None (execution-only) |
| Scheduling | Topological + priorities + gates | Backfill + fair-share + QOS |
| State | Persistent (state.db, cache) | Transient (job lifetime) |
| Data model | Shared filesystem + optional object store | Shared filesystem only |
SLURM vs Ray: When to Use Which
| Dimension | SLURM | Ray |
|---|---|---|
| Target | HPC clusters (static allocation) | Cloud/elastic clusters |
| Submission | sbatch (CLI) or slurmrestd (REST) | Ray Jobs API (HTTP) |
| Scheduling | Fair-share + backfill + QOS | First-come + autoscaler |
| GPU support | GRES (--gres=gpu:2) | First-class (num_gpus=0.5) |
| Data passing | Shared filesystem only | Object store (zero-copy) |
| Job arrays | Native (--array=0-N) | N/A (individual tasks) |
| Latency | 1–5s per submission | ~100ms per submission |
| Scaling model | Fixed cluster, admin-managed | Elastic, autoscaler |
| Multi-user | Fair-share, preemption, QOS | Single-tenant by default |
| Best for | Batch HPC, multi-user clusters, GPU scheduling | ML pipelines, interactive, cloud-native |
Rule of thumb: Use SLURM when you have a shared HPC cluster with existing SLURM infrastructure. Use Ray when you need elastic scaling, fast job turnaround, or in-memory data passing between tasks.
Why Not Snakemake + SLURM?
Snakemake also integrates with SLURM, but with important differences:
Snakemake: Manages the DAG from a long-running process. Submits jobs one at a time as dependencies complete. Uses file timestamps for caching. Cannot do task fusion or materialization elimination.
OxyMake: Submits the entire dependency chain up front via
--dependency=afterok. SLURM sees the full picture and can backfill
more aggressively. Uses content hashes (not timestamps) for caching.
Optimization passes (fusion, materialization elimination) reduce the
number of jobs before submission.
Quick Start
1. Configure the executor
# Oxymakefile.toml
[executor.slurm]
partition = "normal"
staging_dir = "/scratch/$USER/oxymake"
2. Run your workflow on SLURM
ox run --executor slurm
OxyMake handles caching, DAG optimization, and sbatch generation. SLURM handles task placement, resource allocation, and scheduling. Your workflow file does not change.
3. Monitor execution
ox status # OxyMake's view (aggregated)
squeue -u $USER # SLURM's view (per-job)
sacct -j <id> --format=... # Detailed job accounting
4. Run the demo (Docker)
# Build OxyMake
cargo build --bin ox
# Start the test cluster and run the full demo
just demo-slurm
# Or manually:
bash tests/slurm-docker/run-demo.sh
Further Reading
- Executors -- all available executors and configuration
- Execution Modes -- shell, run, script, call
- The Three Graphs -- RuleGraph, JobGraph, ExecGraph
- OxyMake × Ray Deep Dive -- the Ray executor
- Content-Addressable Cache -- how cache keys work
Idempotent Execution
If you have used Terraform, you already understand OxyMake's execution
model. ox run does not mean "launch these jobs." It means "ensure
these outputs exist."
This is a fundamental design choice that affects everything from how you think about running workflows to how multiple people can work on the same pipeline simultaneously.
The Convergent Model
When you run ox run, OxyMake looks at each job in the requested subgraph
and makes a decision:
| Current state | What OxyMake does |
|---|---|
| Output exists and inputs haven't changed | Skip -- nothing to do |
| Job is already running (another session) | Attach -- wait for it, don't re-launch |
| Job is pending and unclaimed | Claim and execute |
| Job failed in a previous run | Re-execute |
The result: running the same command twice does nothing extra. Running it while another instance is already working cooperates instead of conflicting.
ox run --rule '/human/' # Launches the human-cohort jobs
ox run --rule '/human/' # all skipped (cached), nothing re-runs
ox run --rule '/human/' # (while first is running) attaches to running jobs
ox run # Launches yeast+mouse, attaches to human
The Terraform Analogy
If you are familiar with infrastructure-as-code tools, the mapping is direct:
| Terraform | OxyMake | Meaning |
|---|---|---|
terraform plan | ox plan | Show what would happen |
terraform apply | ox run | Make it so |
terraform destroy | ox invalidate | Undo it |
Just as terraform apply creates only the resources that don't already
exist, ox run executes only the jobs whose outputs are missing or stale.
Cooperative Sessions
The most powerful consequence of idempotent execution is that multiple
ox run processes can work on the same project simultaneously, without
conflicts.
How It Works
OxyMake uses SQLite (WAL mode) as a coordination layer. When a session wants to execute a job, it claims it atomically:
UPDATE jobs SET status = 'running', session_id = ?, locked_by = ?
WHERE id = ? AND status = 'pending';
If another session already claimed the job (0 rows affected), the current session either waits for it (if it needs the output) or moves on to other work.
Example: Two Terminals
# Terminal 1: start the human pipeline
ox run --rule '/human/'
# Session 1: 2,100 jobs to run
# Terminal 2 (while T1 is running): start the mouse pipeline
ox run --rule '/mouse/'
# Session 2: 3,423 jobs to run. 0 conflicts with session 1.
# Terminal 3: run everything
ox run
# Session 3: 10,247 total jobs
# 2,100 running (human, session 1) — attaching
# 3,423 running (mouse, session 2) — attaching
# 1,312 cached (completed by sessions 1+2) — skipping
# 3,412 to run (yeast + remaining) — executing
Session 3 does not duplicate work. It attaches to what sessions 1 and 2 are already doing, skips what they have finished, and picks up the rest.
Stale Session Recovery
If a session crashes (power failure, OOM kill), its jobs are not stuck
forever. Each session sends a heartbeat every few seconds. If the heartbeat
is older than 2 minutes, the session is considered dead, and its running
jobs are reset to pending for other sessions to claim.
No manual cleanup required.
The Lifecycle Commands
The convergent model needs symmetric operations. OxyMake provides five commands that form a complete algebra of workflow control:
| Command | Meaning | Analogy |
|---|---|---|
ox run | Ensure outputs exist | terraform apply |
ox cancel | Stop pursuing outputs | Ctrl+C with precision |
ox invalidate | Forget outputs exist | make clean with precision |
ox plan | Show what would happen | terraform plan |
ox status | Show what is happening | kubectl get pods |
Cancel
ox cancel --where cohort=human # Stop human jobs
ox cancel --rule call # Stop all variant calls
ox cancel --session 2 # Stop everything session 2 is doing
ox cancel # Stop everything
Canceled jobs have their partial outputs deleted and their status reset
to pending. The next ox run will re-execute them.
Invalidate
ox invalidate --rule call # Delete variant-call outputs + cache entries
ox invalidate --rule call --cascade # + all downstream outputs
ox invalidate --since "2026-03-22" # Everything computed after this date
ox invalidate --run 3 # Everything from run #3
The --cascade flag is important: invalidating a feature rule without
cascade leaves stale calls that depend on the old feature values.
With --cascade, OxyMake traverses the DAG forward and invalidates
everything downstream.
Why This Matters
The idempotent execution model means:
- No accidental double-execution. Two people running the same command cooperate instead of conflicting.
- Fearless re-running. You can always run
ox runagain. If everything is up to date, it finishes instantly. - Incremental by nature. Add new rules, change parameters, re-run. Only the affected subgraph recomputes.
- Crash-resilient. Completed work survives process death. Just re-run.
- Observable.
ox statusshows exactly what is happening across all sessions.
Crate Graph — How OxyMake Fits Together
A first-time contributor clones two dozen
ox-*crates and needs a mental model before reading any code. This page is that model: which crate does what, which depends on which, and the one rule that keeps the whole thing legible.
If you only remember one sentence, remember this:
ox-coretakes noox-*dependency. Every other crate points inward, towardox-core. Nothing points back out.
That is the textbook hexagonal
(ports-and-adapters) shape. ox-core is the domain. The crates around it are
either supporting domain libraries, driven adapters (things the engine
calls — executors, storage, reports), driving adapters (things that call the
engine — the CLI, the MCP server), or the composition layer that wires them
together (ox-api).
This is not the same picture as the three-graph data pipeline
(RuleGraph → JobGraph → ExecGraph) in The Three Graphs.
That describes how a workflow is resolved at runtime. This page describes how
the code is layered. Newcomers routinely conflate the two — they are
orthogonal.
The shape
graph TB
subgraph driving["Driving adapters — entrypoints (call the engine)"]
cli["ox-cli<br/>the ox binary"]
mcp["ox-mcp<br/>MCP server for agents"]
end
subgraph app["Composition layer (wires the engine together)"]
api["ox-api<br/>embeddable Rust facade"]
end
subgraph support["Supporting domain libraries (depend only on ox-core)"]
format["ox-format"]
state["ox-state"]
cache["ox-cache"]
plan["ox-plan"]
codec["ox-codec-core"]
lock["ox-lock"]
end
core(["ox-core<br/>domain core — ZERO ox-* deps"])
subgraph driven["Driven adapters (the engine calls them)"]
execlocal["ox-exec-local"]
execray["ox-exec-ray"]
execslurm["ox-exec-slurm"]
envsys["ox-env-system"]
envuv["ox-env-uv"]
storage["ox-storage-local"]
repjson["ox-report-json"]
repterm["ox-report-term"]
render["ox-render"]
translate["ox-translate"]
dashboard["ox-dashboard"]
tui["ox-monitor-tui"]
end
cli --> api
cli -->|"+ every driven adapter (see table)"| driven
mcp --> core
mcp --> format
mcp --> state
mcp --> cache
mcp --> plan
api --> core
api --> format
api --> state
api --> cache
api --> plan
format --> core
state --> core
cache --> core
plan --> core
codec --> core
lock --> core
execlocal --> codec
execlocal --> core
execray --> codec
execray --> core
execslurm --> core
envsys --> core
envuv --> core
storage --> core
repjson --> core
repterm --> render
repterm --> core
translate --> format
translate --> core
dashboard --> state
tui --> state
classDef hub fill:#1f6feb,color:#fff,stroke:#0b3a8c,stroke-width:2px;
class core hub;
Every arrow A --> B means "crate A depends on crate B". They all flow
inward. ox-core has no outgoing ox-* arrow — that is the load-bearing
invariant. (ox-render also has no ox-* dependency: it is a pure terminal-
styling leaf that ox-report-term builds on, not a second hub.)
Roles, one line each
ox-core is the hub; the rest are grouped by their architectural role.
The hub
| Crate | Role |
|---|---|
ox-core | Core types, the DAG, the scheduler, and the traits (Storage, Executor, FormatCodec, …) every adapter implements. Zero ox-* dependencies. |
Supporting domain libraries (depend only on ox-core)
| Crate | Role |
|---|---|
ox-format | Parse and serialize the Oxymakefile.toml surface. |
ox-state | Run-state persistence — the SQLite state.db. |
ox-cache | Content-addressable output cache. |
ox-plan | Optimization passes on the JobGraph — pruning, merging, scheduling hints. |
ox-codec-core | The FormatCodec trait and built-in codecs (JSON, CSV, Parquet) for in-memory data passing between jobs. |
ox-lock | The reproducibility lockfile (ox.lock) — captures exact workflow state for drift detection. |
Composition layer
| Crate | Role |
|---|---|
ox-api | The public, embeddable Rust facade. Composes ox-core + ox-format + ox-state + ox-cache + ox-plan into the engine. The single entry point for embedding OxyMake. |
Driving adapters (entrypoints — they call the engine)
| Crate | Role |
|---|---|
ox-cli | The ox binary. Depends on 21 of the 24 ox-* crates — ox-api plus every supporting library and driven adapter — everything except itself and the two not-yet-wired crates below. It is the shell that assembles the whole engine. |
ox-mcp | Model Context Protocol server for AI agents. Composes the same inner crates as ox-api (it does not go through ox-api). |
Driven adapters (the engine calls them — each implements an ox-core trait)
| Crate | Role |
|---|---|
ox-exec-local | Local-process executor. |
ox-exec-ray | Ray-cluster executor (uses ox-codec-core for data passing). |
ox-exec-slurm | SLURM executor. |
ox-env-system | System/host environment provider. |
ox-env-uv | uv-managed per-rule Python virtualenvs. |
ox-storage-local | Local-filesystem Storage implementation. |
ox-report-json | JSON run reports. |
ox-report-term | Terminal run reports (builds on ox-render). |
ox-render | Semantic color roles and terminal styling. No ox-* deps. |
ox-translate | Translate foreign formats (Snakemake, WDL) ↔ Oxymakefile.toml (uses ox-format). |
ox-dashboard | Web dashboard backend (reads ox-state). |
ox-monitor-tui | TUI live monitor (reads ox-state). |
Not yet wired into the ox binary
These crates compile and depend only inward, but no entrypoint consumes them yet. They are staged for a future release, not dead code.
| Crate | Role |
|---|---|
ox-metrics | Prometheus metrics export over ox-state. |
ox-cache-remote | Remote cache backends (S3, GCS, local directory) for sharing artifacts across machines. |
Outside the engine graph
| Crate | Role |
|---|---|
oxymake | Name-reservation crate on crates.io. It is the one publishable crate; the real engine ships as the ox binary via GitHub Releases. Not part of the dependency graph. |
The exact edges (verified against cargo tree)
The table below is the authoritative ox-* → ox-* edge list. It is generated
from each crate's [dependencies] and matches
cargo tree -e no-dev --workspace. The diagram above shows the shape; this
table is the ground truth. If you change an inter-crate dependency, update
this table (and re-confirm the inward-pointing rule).
| Crate | Depends on (ox-* only) |
|---|---|
ox-core | (none — the hub) |
ox-render | (none) |
ox-format | ox-core |
ox-state | ox-core |
ox-cache | ox-core |
ox-cache-remote | ox-core |
ox-plan | ox-core |
ox-codec-core | ox-core |
ox-lock | ox-core |
ox-env-system | ox-core |
ox-env-uv | ox-core |
ox-exec-slurm | ox-core |
ox-storage-local | ox-core |
ox-report-json | ox-core |
ox-exec-local | ox-codec-core, ox-core |
ox-exec-ray | ox-codec-core, ox-core |
ox-report-term | ox-core, ox-render |
ox-translate | ox-core, ox-format |
ox-dashboard | ox-core, ox-state |
ox-monitor-tui | ox-core, ox-state |
ox-metrics | ox-core, ox-state |
ox-api | ox-core, ox-format, ox-state, ox-cache, ox-plan |
ox-mcp | ox-core, ox-format, ox-state, ox-cache, ox-plan |
ox-cli | ox-api + 20 others = 21 of the 24 ox-* crates (all except itself, ox-cache-remote, ox-metrics) |
oxymake | (name reservation — no ox-* deps) |
To regenerate this view locally:
cargo tree -e no-dev --workspace # full dependency tree
cargo tree -e no-dev -i ox-core # invert: who depends on ox-core (≈ everyone)
Why this matters
The inward-pointing rule is what lets you add a new executor, a new storage
backend, or a new report format without touching ox-core — you implement
the relevant ox-core trait in a new ox-exec-* / ox-storage-* /
ox-report-* crate and register it in ox-cli (or ox-api). The domain never
learns about its adapters. That is the whole point of the hexagon, and it is
the project's single best legibility asset.
For the formal boundary between what OxyMake proves and what it assumes of the substrate, see the Boundary — Substrate Axioms note in the repository.
Bioinformatics Pipeline
This cookbook walks through a multi-sample FASTQ-to-BAM-to-VCF variant calling
pipeline in OxyMake. The workflow uses sort, grep, and wc as stand-ins
for real bioinformatics tools (BWA, samtools, GATK), so you can run it on any
machine without installing anything.
The concepts transfer directly to a production pipeline: just swap the shell commands for real tool invocations.
What You Will Learn
- Wildcard-driven sample processing across multiple samples
- Named inputs for rules with multiple input files
- Tags for organizing pipeline stages
- Target-based filtering to run a subset of samples
--rulefiltering to run a subset of stages
The Complete Oxymakefile
Create a directory and save this as Oxymakefile.toml:
ox_version = "0.1"
[config]
samples = ["NA12878", "NA12891", "NA12892"]
chromosomes = ["chr1", "chr2", "chr3"]
# ── Default target ──────────────────────────────────────────────
[rule.all]
input = ["results/cohort_report.txt"]
# ── Stage 1: Generate mock FASTQ reads ─────────────────────────
[rule.simulate_reads]
output = ["fastq/{sample}_R1.fastq", "fastq/{sample}_R2.fastq"]
tags = ["stage.simulate", "fast"]
shell = """
mkdir -p fastq
for i in $(seq 1 50); do
echo "@{sample}_read${i}/1 chr$((i % 3 + 1)):$((i * 100))" >> {output[0]}
echo "ACGTACGTACGTACGT" >> {output[0]}
echo "+" >> {output[0]}
echo "IIIIIIIIIIIIIIII" >> {output[0]}
echo "@{sample}_read${i}/2 chr$((i % 3 + 1)):$((i * 100))" >> {output[1]}
echo "TGCATGCATGCATGCA" >> {output[1]}
echo "+" >> {output[1]}
echo "IIIIIIIIIIIIIIII" >> {output[1]}
done
"""
# ── Stage 2: Align reads → sorted BAM ──────────────────────────
# Stand-in: sort the FASTQ by read name to simulate alignment + sorting.
[rule.align]
input = { r1 = "fastq/{sample}_R1.fastq", r2 = "fastq/{sample}_R2.fastq" }
output = ["aligned/{sample}.bam"]
tags = ["stage.align", "compute-heavy"]
resources = { cpu = 4, mem = "8G" }
shell = """
mkdir -p aligned
echo "## BAM for {sample}" > {output}
echo "## Aligned from {input.r1} and {input.r2}" >> {output}
cat {input.r1} {input.r2} | grep "^@" | sort >> {output}
echo "## EOF" >> {output}
"""
# ── Stage 3: Call variants per chromosome ───────────────────────
# Stand-in: grep reads matching the chromosome, count them as "variants."
[rule.call_variants]
input = { bam = "aligned/{sample}.bam" }
output = ["vcf/{sample}_{chrom}.vcf"]
tags = ["stage.call", "compute-heavy"]
resources = { cpu = 2, mem = "4G" }
shell = """
mkdir -p vcf
echo "##fileformat=VCFv4.2" > {output}
echo "##source=oxymake-cookbook" >> {output}
echo "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO" >> {output}
grep "{chrom}" {input.bam} | awk '{{
split($2, a, ":");
printf "%s\t%s\t.\tA\tG\t30\tPASS\tDP=20\n", a[1], a[2]
}}' >> {output}
"""
# ── Stage 4: Merge per-chromosome VCFs into one per sample ─────
[rule.merge_vcf]
input = ["vcf/{sample}_{chrom}.vcf"]
output = ["vcf/{sample}_merged.vcf"]
tags = ["stage.merge"]
shell = """
echo "##fileformat=VCFv4.2" > {output}
echo "##source=oxymake-merge" >> {output}
echo "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO" >> {output}
for f in {input}; do
grep -v "^#" "$f" >> {output}
done
sort -k1,1 -k2,2n -o {output} {output}
"""
# ── Stage 5: Per-sample QC report ──────────────────────────────
[rule.qc]
input = { bam = "aligned/{sample}.bam", vcf = "vcf/{sample}_merged.vcf" }
output = ["qc/{sample}_report.txt"]
tags = ["stage.qc", "fast"]
shell = """
mkdir -p qc
echo "=== QC Report: {sample} ===" > {output}
echo "Total reads: $(grep -c "^@" {input.bam})" >> {output}
echo "Variants called: $(grep -vc "^#" {input.vcf})" >> {output}
echo "Chromosomes: $(grep -v "^#" {input.vcf} | cut -f1 | sort -u | tr '\n' ' ')" >> {output}
"""
# ── Stage 6: Cohort report ─────────────────────────────────────
[rule.cohort_report]
input = ["qc/{sample}_report.txt"]
output = ["results/cohort_report.txt"]
tags = ["stage.report"]
shell = """
mkdir -p results
echo "=============================" > {output}
echo " Variant Calling Cohort Report" >> {output}
echo "=============================" >> {output}
echo "" >> {output}
for f in {input}; do
cat "$f" >> {output}
echo "" >> {output}
done
echo "--- Summary ---" >> {output}
echo "Samples processed: $(echo {input} | wc -w | tr -d ' ')" >> {output}
"""
Create the Project
mkdir bioinfo-pipeline && cd bioinfo-pipeline
# Save the Oxymakefile.toml above
ox init # if you want the .oxymake directory pre-created
No input data files are needed -- the simulate_reads rule generates
everything from scratch.
Run the Full Pipeline
ox plan
Plan: 5 rules, 15 jobs, 3 source files
Targets: results/cohort_report.txt
1. [simulate_reads-NA12878] rule=simulate_reads -> [fastq/NA12878_R1.fastq, fastq/NA12878_R2.fastq]
2. [simulate_reads-NA12891] rule=simulate_reads -> [fastq/NA12891_R1.fastq, fastq/NA12891_R2.fastq]
3. [align-NA12878] rule=align -> [aligned/NA12878.bam]
4. [call_variants-NA12878-chr1] rule=call_variants -> [vcf/NA12878_chr1.vcf]
...
15. [cohort_report] rule=cohort_report -> [results/cohort_report.txt]
ox run -j 4
OxyMake runs up to 4 jobs in parallel. The simulate_reads jobs run first
(no dependencies), then align, then call_variants fans out across
samples and chromosomes, and finally everything converges into the cohort
report.
Filter by Sample
Run only one sample during development by requesting its leaf target (wildcards in the target select the matching jobs):
ox run "qc/NA12878_report.txt"
This builds the pipeline for NA12878 only, skipping NA12891 and NA12892. Combined with caching, this lets you iterate on pipeline logic without waiting for all samples.
Later, run the full cohort:
ox run
NA12878 is cached. Only NA12891 and NA12892 are computed.
Filter by Rule
Run only the QC stage with --rule (exact name or /regex/; assumes
upstream outputs exist):
ox run --rule qc
View the DAG grouped by stage:
ox dag --group-by tag
Named Inputs
Several rules use named inputs for clarity. Compare:
# Positional (works but cryptic with multiple inputs)
input = ["aligned/{sample}.bam", "vcf/{sample}_merged.vcf"]
shell = "check {input[0]} {input[1]}"
# Named (self-documenting)
input = { bam = "aligned/{sample}.bam", vcf = "vcf/{sample}_merged.vcf" }
shell = "check {input.bam} {input.vcf}"
Named inputs make your workflow readable as it grows.
Adding a New Sample
Edit Oxymakefile.toml:
[config]
samples = ["NA12878", "NA12891", "NA12892", "NA12893"]
Run again:
ox run -j 4
Only the NA12893 jobs run. Everything else is cached.
Adapting to Real Tools
Replace the stand-in commands with real bioinformatics tools:
[rule.align]
input = { r1 = "fastq/{sample}_R1.fastq", r2 = "fastq/{sample}_R2.fastq" }
output = ["aligned/{sample}.bam"]
tags = ["stage.align", "compute-heavy"]
resources = { cpu = 8, mem = "32G" }
shell = """
bwa mem -t {resources.cpu} reference.fa {input.r1} {input.r2} \
| samtools sort -@ 4 -o {output}
samtools index {output}
"""
The workflow structure stays the same. Only the shell commands change.
Next Steps
- Rules and Wildcards -- wildcard expansion and constraints
- Tags and Filtering -- organizing large workflows
- Idempotent Execution -- cooperative multi-session runs
Climate Time-Series Pipeline
This cookbook builds a multi-station climate analysis pipeline in OxyMake. It covers feature engineering, index generation, and regional aggregation across a network of weather stations -- all driven by wildcards, snapshots, and execution history. Mock data (random readings) keeps the example self-contained.
What You Will Learn
- Config-driven station network and parameter sweeps
- Wildcard expansion across stations, features, and rolling windows
- Named inputs for multi-file rules
- Snapshots to compare analysis milestones
- Execution history as a lightweight lab notebook
- Tag-based filtering for fast iteration
The Complete Oxymakefile
Create a directory and save this as Oxymakefile.toml:
ox_version = "0.1"
[config]
stations = ["BOS", "DEN", "SEA", "AUS", "PDX"]
windows = [5, 10, 20, 60]
metric = ["trend", "anomaly"]
# ── Default target ──────────────────────────────────────────────
[rule.all]
input = ["reports/network_summary.txt"]
# ── Stage 1: Generate mock temperature readings ────────────────
[rule.mock_readings]
output = ["data/readings/{station}.csv"]
tags = { stage = "data", speed = "fast" }
shell = """
mkdir -p data/readings
echo "date,temp" > {output}
temp=15
for day in $(seq 1 252); do
# Random daily temperature delta between -3 and +3 degrees
d=$(awk "BEGIN {{srand($day * 17 + $(echo {station} | cksum | cut -d' ' -f1)); printf \"%.4f\", (rand() - 0.5) * 6}}")
temp=$(awk "BEGIN {{printf \"%.2f\", $temp + $d}}")
printf "2025-%03d,%s\\n" "$day" "$temp" >> {output}
done
"""
# ── Stage 2: Compute features ─────────────────────────────────
[rule.features]
input = { readings = "data/readings/{station}.csv" }
output = ["data/features/{station}_{window}d.csv"]
tags = { stage = "features", speed = "fast" }
shell = """
mkdir -p data/features
echo "date,{station}_trend_{window}d,{station}_var_{window}d" > {output}
tail -n +2 {input.readings} | awk -F, -v lb={window} '
BEGIN {{ OFS="," }}
{{
temps[NR] = $2
if (NR >= lb) {{
trend = (temps[NR] - temps[NR - lb + 1]) / lb
sum = 0; sq = 0
for (i = NR - lb + 1; i <= NR; i++) {{
r = temps[i] - temps[i-1]
sum += r; sq += r * r
}}
var = (sq - sum*sum/lb) / (lb - 1)
printf "%s,%.6f,%.6f\\n", $1, trend, var
}}
}}
' >> {output}
"""
# ── Stage 3: Generate indices ─────────────────────────────────
[rule.indices]
input = ["data/features/{station}_{window}d.csv"]
output = ["data/indices/{station}_{metric}.csv"]
tags = { stage = "indices" }
shell = """
mkdir -p data/indices
echo "date,{station}_{metric}" > {output}
if [ "{metric}" = "trend" ]; then
# Average trend across windows → warming vs cooling stations
paste -d, data/features/{station}_*d.csv \
| tail -n +2 \
| awk -F, '{{ sum=0; n=0; for(i=2;i<=NF;i+=2){{ sum+=$i; n++ }}; if(n>0) printf "%s,%.6f\\n",$1,sum/n }}' \
>> {output}
else
# Anomaly: deviation from the mean trend
paste -d, data/features/{station}_*d.csv \
| tail -n +2 \
| awk -F, '{{ sum=0; n=0; for(i=2;i<=NF;i+=2){{ sum+=$i; n++ }}; if(n>0) printf "%s,%.6f\\n",$1,-sum/n }}' \
>> {output}
fi
"""
# ── Stage 4: Cross-station composite index ────────────────────
[rule.composite]
input = ["data/indices/{station}_{metric}.csv"]
output = ["data/composite/{metric}_index.csv"]
tags = { stage = "composite" }
shell = """
mkdir -p data/composite
echo "date,station,weight" > {output}
# Rank-based regional index: center station values cross-sectionally to zero
paste -d, data/indices/*_{metric}.csv \
| tail -n +2 \
| awk -F, '
BEGIN {{ split("{station}", stations, " ") }}
{{
n = 0; sum = 0
for (i = 2; i <= NF; i += 2) {{ vals[++n] = $i; sum += $i }}
mean = sum / n
wsum = 0
for (i = 1; i <= n; i++) {{ w[i] = vals[i] - mean; wsum += (w[i]>0?w[i]:-w[i]) }}
if (wsum > 0) for (i = 1; i <= n; i++) w[i] /= wsum
for (i = 1; i <= n; i++) printf "%s,%s,%.6f\\n", $1, stations[i], w[i]
}}
' >> {output}
"""
# ── Stage 5: Cumulative index score ──────────────────────────
[rule.score]
input = {
weights = "data/composite/{metric}_index.csv",
readings = "data/readings/{station}.csv"
}
output = ["data/score/{metric}_score.csv"]
tags = { stage = "score" }
shell = """
mkdir -p data/score
echo "date,daily_index,cumulative_index" > {output}
# Simple: weight * daily reading, summed across stations
awk -F, '
NR == FNR && FNR > 1 {{ weights[$1,$2] = $3; next }}
FNR > 1 {{ readings[$1] = $2 }}
' {input.weights} data/readings/*.csv
# Simplified: accumulate a weighted daily index
tail -n +2 {input.weights} | awk -F, '
{{ idx[$1] += $3 * (rand() - 0.48) * 0.02 }}
END {{
cum = 0
n = asorti(idx, dates)
for (i = 1; i <= n; i++) {{
cum += idx[dates[i]]
printf "%s,%.6f,%.6f\\n", dates[i], idx[dates[i]], cum
}}
}}
' >> {output}
"""
# ── Stage 6: Summary report ──────────────────────────────────
[rule.report]
input = ["data/score/{metric}_score.csv"]
output = ["reports/network_summary.txt"]
tags = { stage = "report", speed = "fast" }
shell = """
mkdir -p reports
echo "======================================" > {output}
echo " Climate Network Pipeline — Summary" >> {output}
echo "======================================" >> {output}
echo "" >> {output}
echo "Network: {station}" >> {output}
echo "Windows: {window}" >> {output}
echo "Metrics: {metric}" >> {output}
echo "" >> {output}
for f in {input}; do
index=$(basename "$f" _score.csv)
lines=$(tail -n +2 "$f" | wc -l | tr -d ' ')
final=$(tail -1 "$f" | cut -d, -f3)
echo "Index: $index" >> {output}
echo " Observation days: $lines" >> {output}
echo " Final cumulative index: $final" >> {output}
echo "" >> {output}
done
echo "--- Pipeline complete ---" >> {output}
"""
Create the Project
mkdir climate-pipeline && cd climate-pipeline
# Save the Oxymakefile.toml above
No input data files are needed -- mock_readings generates synthetic data.
Explore the DAG
ox plan
Plan: 6 rules, 42 jobs, 5 source files
Targets: reports/network_summary.txt
1. [mock_readings-BOS] rule=mock_readings -> [data/readings/BOS.csv]
2. [features-BOS-5d] rule=features -> [data/features/BOS_5d.csv]
3. [features-DEN-5d] rule=features -> [data/features/DEN_5d.csv]
...
40. [composite-trend] rule=composite -> [data/composite/trend.csv]
41. [score-trend] rule=score -> [data/scores/trend.csv]
42. [report] rule=report -> [reports/network_summary.txt]
The DAG fans out across stations and windows, then converges through indices and regional aggregation into a single report.
Run the Full Pipeline
ox run -j 4
OxyMake runs up to 4 jobs in parallel. The mock_readings jobs run first (no
dependencies), then features fans out across stations x windows, and
everything converges into the network report.
Iterate on a Single Station
During development, focus on one station by requesting its leaf target (wildcards in the target select the matching jobs):
ox run "data/indices/BOS_*.csv"
This builds the pipeline for BOS only. Later, run the full network:
ox run
BOS is cached. Only the remaining stations are computed.
Filter by Rule
Run only the feature computation stage with --rule (exact name or
/regex/):
ox run --rule features
Snapshots: Compare Analysis Milestones
After a successful run, save a snapshot:
ox snapshot create baseline --message "5-station trend + anomaly"
Now add a new window (120 days) and a new metric. Edit the config:
[config]
windows = [5, 10, 20, 60, 120]
metric = ["trend", "anomaly", "seasonal"]
Run again and save another snapshot:
ox run -j 4
ox snapshot create v2 --message "Added 120d window + seasonal metric"
Compare the two milestones:
ox snapshot diff baseline v2
Workflow hash changed (config modified)
Added: 15 jobs (features/*_120d, indices/*_seasonal, ...)
Changed: 2 jobs (composite, report — new inputs)
Unchanged: 40 jobs
This tells you exactly what changed between analysis iterations without manually tracking file modifications.
Execution History as a Lab Notebook
Each ox run is recorded with timing, job counts, and optional notes:
ox run -j 4 --note "Baseline: 5 stations, 4 windows"
# ... iterate ...
ox run -j 4 --note "Added seasonal metric, 120d window"
Review your analysis timeline:
ox history
RUN STARTED DURATION OK FAIL SKIP NOTE
run-a1b2c3 2025-01-15 09:12 12.3s 42 0 0 Baseline: 5 stations, 4 windows
run-d4e5f6 2025-01-15 09:45 4.1s 15 0 40 Added seasonal metric, 120d window
Drill into a specific run:
ox history --run-id run-a1b2c3
This shows per-job wall time, peak memory, and exit codes -- useful for identifying bottlenecks as your network grows.
Scaling the Network
Add more stations by editing [config]:
[config]
stations = ["BOS", "DEN", "SEA", "AUS", "PDX", "ORD", "ATL", "LAX", "JFK", "MIA"]
Run again:
ox run -j 8
Only the new stations are computed. Everything else is cached. As the network grows from 5 to 50 to 500 stations, the same Oxymakefile works -- OxyMake expands the wildcards and parallelizes automatically.
Next Steps
- Growing a Workflow Organically -- evolving from 3 rules to 300+ over weeks of analysis
- Agent-Driven Workflows -- automating the pipeline with LLM agents and NDJSON event streams
- Rules and Wildcards -- wildcard expansion and constraints
- Snapshots -- saving and comparing workflow state
ML Training Pipeline
Coming soon.
This page will show a machine learning training pipeline in OxyMake, covering:
- Data preparation: feature extraction, train/test splitting, and normalization
- Hyperparameter sweeps: wildcard-driven grid search across learning rates, architectures, and regularization parameters
- GPU resource management: declaring GPU requirements per rule for SLURM/Kubernetes scheduling
- Model evaluation: automated metric collection and comparison
- In-memory passing: using
callmode with Arrow IPC to pass DataFrames between feature computation and training without disk I/O
Growing a Workflow Organically
Coming soon.
This page will illustrate how real research workflows evolve over time, covering:
- Starting small: a 3-rule exploratory workflow
- Adding complexity incrementally: new rules never invalidate existing cached results thanks to content-addressable caching
- Workflow composition: splitting large workflows across files with
includedirectives - Snapshot milestones: saving and comparing workflow states as research progresses
- Run annotations: using
ox run --noteto create a lightweight research lab notebook from the execution history
Agent-Driven Workflows
Coming soon.
This page will demonstrate how AI agents can drive OxyMake pipelines programmatically, covering:
- Structured NDJSON events: parsing
--jsonoutput for typed event streams - Programmatic gate approval: agents evaluating metrics and approving
quality checkpoints via
ox gate approve - Automated error recovery: detecting failures from JSON events, adjusting parameters, and retrying
- Multi-agent coordination: multiple agents driving different stages of a pipeline
- End-to-end example: a complete pipeline driven by an LLM agent without human intervention
Cloud HPC with SLURM
OxyMake's SLURM executor targets any SLURM cluster — on-prem, academic (Grid'5000, Jean Zay), or cloud. This guide works one concrete cloud example end-to-end: a Google Cloud cluster provisioned with the HPC Toolkit. The same shape applies to AWS ParallelCluster, Azure CycleCloud, or any managed SLURM-on-cloud offering — only the provisioning commands change; the OxyMake profile and run loop are identical. It covers cluster provisioning, profile configuration, SSH tunneling for remote access, and running pipelines end-to-end.
Prerequisites
- A GCP project with billing enabled
gcloudCLI installed and authenticated (gcloud auth login)- Terraform >= 1.3
- The Cloud HPC Toolkit (
ghpcCLI)
Cluster Architecture
The HPC Toolkit deploys a standard SLURM cluster on GCP:
graph TD
subgraph VPC["GCP VPC"]
Login["Login Node<br/>(SSH entry)"] --> Controller["Controller<br/>(slurmctld)"]
Controller --> C0["c2-0 node"]
Controller --> C1["c2-1 node"]
Controller --> CN["c2-N node"]
NFS["Filestore (NFS) — /mnt/shared"]
end
Key points:
- Controller node runs
slurmctldand schedules jobs - Compute nodes auto-scale — spin up when jobs are queued, shut down when idle
- Filestore provides the shared NFS filesystem required by OxyMake's SLURM executor
- Login node is your SSH entry point for running
ox run
Step 1: Provision the Cluster
Create the blueprint
Create a file oxymake-cluster.yaml:
# oxymake-cluster.yaml — HPC Toolkit blueprint
blueprint_name: oxymake-slurm
vars:
project_id: YOUR_PROJECT_ID
deployment_name: oxymake-slurm
region: us-central1
zone: us-central1-a
deployment_groups:
- group: primary
modules:
# Shared filesystem (required by OxyMake SLURM executor)
- id: homefs
source: modules/file-system/filestore
settings:
local_mount: /mnt/shared
size_gb: 1024
# Network
- id: network
source: modules/network/vpc
# SLURM partition — general-purpose compute
- id: compute_partition
source: community/modules/compute/schedmd-slurm-gcp-v6-partition
use: [network, homefs]
settings:
partition_name: batch
machine_type: c2-standard-8 # 8 vCPU, 32 GB
max_count: 10 # Auto-scales 0 → 10 nodes
enable_placement: false
# GPU partition (optional)
- id: gpu_partition
source: community/modules/compute/schedmd-slurm-gcp-v6-partition
use: [network, homefs]
settings:
partition_name: gpu
machine_type: a2-highgpu-1g # 1× A100
max_count: 4
enable_placement: false
# SLURM controller + login node
- id: slurm_controller
source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
use: [network, compute_partition, gpu_partition]
settings:
login_node_count: 1
# Login node
- id: slurm_login
source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
use: [network, slurm_controller]
settings:
machine_type: e2-standard-4
Deploy
# Generate Terraform from the blueprint
ghpc create oxymake-cluster.yaml
# Deploy
ghpc deploy oxymake-slurm
# Wait for the cluster to be ready (~5 minutes)
gcloud compute ssh oxymake-slurm-login0 --zone us-central1-a -- sinfo
You should see the batch and gpu partitions in the output.
Step 2: Configure the OxyMake Profile
Add a [profile.gcloud] section to your Oxymakefile.toml:
[profile.gcloud]
executor = "slurm"
partition = "batch"
account = "default"
jobs = 100 # SLURM handles scheduling; allow many concurrent
keep_going = true # Don't abort the full DAG on a single failure
[profile.gcloud-gpu]
executor = "slurm"
partition = "gpu"
account = "default"
jobs = 20
Run with the profile:
ox run --profile gcloud
ox run --profile gcloud-gpu # For GPU workloads
Profile fields map to SLURM flags:
| Profile field | SLURM flag | Notes |
|---|---|---|
executor | -- | Selects the SLURM backend |
partition | --partition | Target partition (batch, gpu) |
account | --account | Billing/fairshare account |
qos | --qos | Quality of service tier |
jobs | -- | OxyMake concurrency (not SLURM's) |
CLI flags always override profile values: ox run --profile gcloud --partition gpu
overrides the partition from batch to gpu.
Step 3: Prepare the Cluster
SSH into the login node and set up OxyMake:
gcloud compute ssh oxymake-slurm-login0 --zone us-central1-a
On the login node:
# Install OxyMake (from prebuilt binary or cargo)
curl -fsSL https://oxymake.noogram.dev/install.sh | sh
# or: cargo install oxymake
# Clone your workflow into the shared filesystem
cd /mnt/shared
git clone https://github.com/your-org/your-pipeline.git
cd your-pipeline
# Verify SLURM is accessible
sinfo # Should show partitions
ox run --executor slurm --dry-run # Should show the DAG without submitting
Important: Run ox run from the login node (or controller), not from a
compute node. The state.db must be on a local filesystem — /mnt/shared is
NFS, so OxyMake stores state.db in a local directory by default.
Step 4: Run a Pipeline
# Dry run — see what would be submitted
ox run --profile gcloud --dry-run
# Submit the pipeline
ox run --profile gcloud
# Monitor jobs
squeue -u $USER # SLURM's view
ox run --profile gcloud --status # OxyMake's view (if supported)
On GCP with auto-scaling, compute nodes spin up on demand. The first run may take a few extra minutes while nodes boot. Subsequent runs are faster as nodes remain warm for the configured idle timeout (default: 5 minutes).
SSH Tunnel for Remote Access
When running OxyMake from your local machine (not SSH'd into the cluster),
you can either tunnel to the SLURM CLI tools (Option A/B) or use
REST mode via slurmrestd (Option C).
Option A: SSH ProxyCommand (recommended)
Add to your ~/.ssh/config:
Host oxymake-slurm
HostName <login-node-external-ip>
User your-username
IdentityFile ~/.ssh/google_compute_engine
# Or use gcloud's IAP tunnel:
# ProxyCommand gcloud compute ssh oxymake-slurm-login0 --zone us-central1-a --tunnel-through-iap --plain -- -W %h:%p
Then SSH in and run:
ssh oxymake-slurm "cd /mnt/shared/your-pipeline && ox run --profile gcloud"
Option B: IAP Tunnel (no public IP required)
If your login node has no external IP (common for secure setups), use Identity-Aware Proxy:
# Direct SSH via IAP
gcloud compute ssh oxymake-slurm-login0 \
--zone us-central1-a \
--tunnel-through-iap
# Or set up a SOCKS proxy for port forwarding
gcloud compute ssh oxymake-slurm-login0 \
--zone us-central1-a \
--tunnel-through-iap \
-- -D 1080 -N -f
# Forward the OxyMake dashboard port (if using ox dashboard)
gcloud compute ssh oxymake-slurm-login0 \
--zone us-central1-a \
--tunnel-through-iap \
-- -L 8080:localhost:8080 -N -f
Option C: SSH Tunnel for slurmrestd
Forward the slurmrestd port to your workstation and use REST mode:
# Forward slurmrestd (port 6820) to localhost
gcloud compute ssh oxymake-slurm-login0 \
--zone us-central1-a \
--tunnel-through-iap \
-- -L 6820:slurmctld:6820 -N -f
# Run OxyMake in REST mode via the tunnel
ox run --executor slurm --slurm-api http://localhost:6820
Note: REST mode requires
slurmrestdto be running on the cluster. SetSLURM_JWTfor JWT authentication if required by your cluster.
Cluster Lifecycle
Scale down
GCP auto-scaling shuts down idle nodes. To force-stop:
# Drain all compute nodes
scontrol update partition=batch state=DRAIN
# Or destroy the cluster entirely
ghpc destroy oxymake-slurm
Cost control
| Resource | Billing | Tip |
|---|---|---|
| Controller | Always on | Use e2-standard-4 (small) |
| Login node | Always on | Use e2-standard-4 (small) |
| Compute nodes | On-demand (auto-scale) | Set max_count conservatively |
| Filestore | Always on (per GB) | Delete when not in use |
For intermittent workloads, consider stopping the controller and login node when not running pipelines:
gcloud compute instances stop oxymake-slurm-controller --zone us-central1-a
gcloud compute instances stop oxymake-slurm-login0 --zone us-central1-a
# Restart when needed:
gcloud compute instances start oxymake-slurm-controller --zone us-central1-a
gcloud compute instances start oxymake-slurm-login0 --zone us-central1-a
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
sinfo shows no nodes | Cluster still provisioning | Wait 5 min, check gcloud compute instances list |
| Jobs stuck in PENDING | No nodes available / auto-scale starting | Wait for nodes to boot; check sinfo -N |
sbatch: command not found | Not on login/controller node | SSH to the login node first |
Permission denied on /mnt/shared | Filestore not mounted | Check mount | grep shared; re-run sudo mount |
| state.db lock error | Running from NFS | Run ox run from local disk on the login node |
| Nodes not auto-scaling | Partition misconfigured | Check scontrol show partition batch |
Oxymakefile Format
OxyMake workflows are defined in Oxymakefile.toml, a declarative TOML file.
This page is the complete format reference.
Top-Level Fields
ox_version = "0.1" # Required. OxyMake format version.
Config Section
The [config] section defines workflow-level variables used for wildcard
expansion:
[config]
samples = ["A", "B", "C"]
chromosomes = ["chr1", "chr2", "chr3"]
models = ["linear", "ridge", "lasso"]
Config values are arrays of strings. They drive wildcard expansion in rules.
Rule Definitions
Each rule is a [rule.<name>] table:
[rule.process]
input = ["data/{sample}.csv"]
output = ["results/{sample}.txt"]
shell = "python process.py {input} {output}"
Rule Fields
| Field | Type | Required | Description |
|---|---|---|---|
input | Array of strings | No | Input file patterns with {wildcards} |
output | Array of strings | Yes | Output file patterns with {wildcards} |
shell | String | One of shell/run/script/call | Opaque shell command |
run | String | One of shell/run/script/call | Inline script (with lang) |
script | String | One of shell/run/script/call | Path to script file |
call | String | One of shell/run/script/call | Python function reference |
lang | String | With run/script | Language: python, r, julia |
tags | Array of strings | No | Tags for filtering and grouping |
resources | Table | No | Resource requirements |
env | String | No | Environment to use |
when | String | No | Conditional guard expression |
materialize | String | No | always, auto, never, final |
params | Table | No | Rule-specific parameters |
Execution Modes
Four modes form a spectrum from flexibility to optimizability:
shell -- Opaque shell command. Maximum flexibility, no optimization.
[rule.align]
shell = "bwa mem ref.fa {input} > {output}"
run -- Inline script with language specification.
[rule.stats]
lang = "python"
run = """
import pandas as pd
df = pd.read_csv("{input}")
df.describe().to_csv("{output}")
"""
script -- External script file.
[rule.analyze]
lang = "python"
script = "scripts/analyze.py"
call -- Pure function reference. Supports in-memory Arrow IPC passing.
[rule.features]
input = [{ path = "data/{sample}.parquet", format = "parquet" }]
output = [{ path = "features/{sample}.parquet", format = "parquet", materialize = "auto" }]
call = "pipeline.features:compute_features"
Wildcards
Wildcards in {braces} are resolved from [config] arrays or inferred
from existing files:
[config]
samples = ["A", "B"]
[rule.process]
input = ["data/{sample}.csv"] # {sample} expanded from config.samples
output = ["results/{sample}.txt"]
Resources
[rule.heavy_job]
output = ["results/big.txt"]
shell = "compute_heavy"
resources = { cpus = 4, mem_gb = 16, gpu = 1, time_min = 60 }
Conditional Guards
[rule.expensive]
output = ["results/{seed}.txt"]
shell = "compute {seed}"
when = "seed in @selected_seeds"
Guards are evaluated at DAG resolution time. Jobs whose guard is false are never created.
Include Directives
Split large workflows across files:
include = ["rules/alignment.toml", "rules/qc.toml"]
Environment Specification
[env.analysis]
type = "uv"
requirements = "requirements.txt"
[rule.analyze]
env = "analysis"
Supported environment types: system, uv, conda, docker, nix.
Next Steps
- CLI Commands -- how to run workflows
- Expression Language -- guard and expression syntax
- Configuration -- project-level settings
CLI Commands
OxyMake provides the ox command-line tool. Every command supports --json
for structured NDJSON output.
Core Commands
ox init
Initialize a new OxyMake project in the current directory.
ox init
Creates a starter Oxymakefile.toml and .oxymake/ directory.
ox run
Execute the workflow, ensuring requested outputs exist.
ox run # Build default targets
ox run results/report.html # Build a specific target
ox run -j 8 # Parallel execution (8 jobs)
ox run --rule stats # Only run jobs from a rule (exact or /regex/)
ox run --json # Structured NDJSON output
ox run --note "experiment v2" # Annotate the run
ox run --no-cache # Ignore the cache, re-run everything
Options:
-j N,--jobs N-- Maximum concurrent jobs (default: 1)--rule RULE-- Only run jobs from this rule (exact name or/regex/)-k,--keep-going-- Continue independent jobs after a failure-n,--dry-run-- Show what would run without executing--json-- Emit NDJSON events on stdout--report-json PATH-- Write the NDJSON event stream to a file--note TEXT-- Attach a note to this run--no-cache-- Ignore cached outputs and re-execute--executor EXEC-- Choose executor:local(default),slurm,ray
Exit codes:
0-- Success (all jobs succeeded or were cached)1-- Runtime error or one or more jobs failed2-- Command-line usage error
ox plan
Show the execution plan without running anything.
ox plan # Show what would run (optimized)
ox plan --json # Structured plan output
ox plan --no-optimize # Show the raw plan (skip optimization passes)
ox plan --level rules # Show the RuleGraph instead of the JobGraph
ox lint
Validate the Oxymakefile without executing.
ox lint # Check for errors
ox lint --json # Structured diagnostics
Checks for: syntax errors, missing inputs, cycles, ambiguous rules, undefined wildcards.
Inspection Commands
ox dag
Visualize the dependency graph.
ox dag # Graphviz DOT output (default)
ox dag --format mermaid # Mermaid graph syntax
ox dag --group-by rule # Collapse nodes by field
ox dag --json # Structured JSON
ox status
Show current execution status.
ox status # Summary of current state
ox status --json # Structured status
ox logs
View job logs.
ox logs stats-alice # Logs for a specific job
ox logs --failed # Logs for all failed jobs
ox history
List past runs.
ox history # Recent runs
ox history --json # Structured history
Management Commands
ox gate
Manage gates (human-in-the-loop checkpoints).
ox gate list # Show pending gates
ox gate approve qc_check # Approve a gate
ox gate approve qc_check --reason "ok" # Approve with reason
ox snapshot
Manage workflow snapshots for comparison.
ox snapshot save baseline-v1 # Save current state
ox snapshot diff baseline-v1 # Compare with snapshot
ox snapshot list # List snapshots
ox invalidate
Invalidate cached outputs to force re-execution.
ox invalidate stats # Invalidate a rule
ox invalidate results/alice.txt # Invalidate a specific output
ox clean
Remove outputs and cache.
ox clean # Remove all outputs
ox clean --cache # Also remove cache
ox clean --state # Delete a corrupt state.db (it is a regenerable cache)
ox cancel
Cancel running jobs.
ox cancel # Cancel all running jobs
ox cancel stats-alice # Cancel a specific job
ox top
Live TUI dashboard for monitoring execution.
ox top # Interactive dashboard
Shows real-time job status, resource utilization, and DAG progress.
Global Options
Every command accepts:
| Flag | Description |
|---|---|
--color <MODE> | Color output mode (auto, always, never) |
-V, --version | Print version |
-h, --help | Print help |
Most subcommands additionally accept --json (structured NDJSON output) and
-v/-vv (increase verbosity).
Next Steps
- Oxymakefile Format -- workflow definition reference
- Configuration -- project settings
ox lock
Generate or verify a reproducibility lockfile.
The ox lock command captures a cryptographic snapshot of the entire workflow
— rule definitions, config values, input hashes — into an ox.lock file. Use
it to detect unintended changes between runs or across machines.
Subcommands
ox lock generate
Generate an ox.lock file from the current workflow state.
ox lock generate # Write ox.lock next to Oxymakefile.toml
ox lock generate -o locks/my.lock # Write to a custom path
ox lock generate -f path/Oxymakefile.toml
Options:
| Flag | Description |
|---|---|
-f, --file <FILE> | Oxymakefile path (default: Oxymakefile.toml) |
-o, --output <OUTPUT> | Output lockfile path (default: ox.lock next to the Oxymakefile) |
ox lock verify
Verify the current state against an existing ox.lock.
ox lock verify # Verify against ox.lock
ox lock verify -l locks/my.lock # Verify against a custom lockfile
Options:
| Flag | Description |
|---|---|
-f, --file <FILE> | Oxymakefile path (default: Oxymakefile.toml) |
-l, --lockfile <LOCKFILE> | Lockfile path (default: ox.lock next to the Oxymakefile) |
Exit codes:
0— Lock matches current state1— Mismatch detected (details printed to stderr)
Examples
# Pin the workflow before a release
ox lock generate
git add ox.lock && git commit -m "lock: pin workflow v2.1"
# CI: verify nothing drifted
ox lock verify || { echo "Workflow changed since lock!"; exit 1; }
See Also
- CLI Commands — full command index
- Content-Addressable Cache — how OxyMake tracks state
ox test
Test and validate a workflow without executing it.
The ox test command resolves the DAG, checks for structural errors, and
optionally simulates execution order — all without running any shell commands.
Use it to catch misconfigurations before committing to a full run.
Usage
ox test # Validate entire workflow
ox test results/report.html # Validate a specific target
ox test --dry-run # Simulate execution order
ox test --json # Output NDJSON diagnostics
Arguments
| Argument | Description |
|---|---|
[TARGETS]... | Target files or patterns to test (default: all) |
Options
| Flag | Description |
|---|---|
-f, --file <FILE> | Oxymakefile path (default: Oxymakefile.toml) |
-n, --dry-run | Simulate execution order without running |
--json | Output NDJSON |
What It Checks
- Oxymakefile parses without errors
- All wildcards resolve against
[config]values - Dependency graph is acyclic
- Every input is either a source file or produced by a rule
- Wildcard constraints are satisfied
Examples
# Quick validation in CI
ox test || exit 1
# Check a single target's dependency chain
ox test results/{sample}_stats.tsv
# Dry-run to see execution order
ox test --dry-run
See Also
ox dashboard
Web dashboard for monitoring and DAG visualization.
The ox dashboard command starts a local HTTP server that serves an interactive
web UI. The dashboard reads from the OxyMake state database and provides
real-time job status, DAG visualization, and run history.
Usage
ox dashboard # Start on http://127.0.0.1:9876
ox dashboard --port 8080 # Custom port
ox dashboard --bind 0.0.0.0 # Listen on all interfaces
ox dashboard --db path/to/state.db # Custom state database
Options
| Flag | Description |
|---|---|
--db <DB> | Path to state.db (default: .oxymake/state.db) |
--port <PORT> | Port to listen on (default: 9876) |
--bind <BIND> | Bind address (default: 127.0.0.1) |
Features
- Status cards — at-a-glance counts of running, succeeded, and failed jobs
- DAG visualization — interactive dependency graph
- Job table — sortable list of all jobs with status and timing
- Run history — browse past runs and their outcomes
Examples
# Start dashboard alongside a long-running workflow
ox run -j 8 &
ox dashboard
# Open http://127.0.0.1:9876 in a browser
# Expose to the local network (e.g. for a shared workstation)
ox dashboard --bind 0.0.0.0 --port 8080
See Also
ox translate
Translate a Snakefile into OxyMake TOML.
The ox translate command parses a Snakemake Snakefile and emits an
equivalent Oxymakefile.toml. Use it to migrate existing Snakemake workflows
to OxyMake without rewriting rules by hand.
Usage
ox translate Snakefile # Writes Snakefile.translated.toml
ox translate Snakefile -o Oxymakefile.toml # Writes a custom path
When -o is omitted, the translator writes two files next to the input:
<INPUT>.translated.toml— the generated Oxymakefile<INPUT>.translated.toml.escalations.toml— written only when the IR contains escalations
Every run emits a one-line summary to stderr:
translated: N rules (X mechanical, Y with escalations); dropped: Z unsupported top-level constructs; includes: K files NOT followed
ox translate exits with status 2 when escalations were recorded so CI
or shell scripts can gate on a clean translation. The files are still
written; only the exit code changes.
Arguments
| Argument | Description |
|---|---|
<SNAKEFILE> | Path to the Snakefile to translate |
Options
| Flag | Description |
|---|---|
-o, --output <OUTPUT> | Write the translated TOML to this path instead of the default <INPUT>.translated.toml. The escalation file lands at <OUTPUT>.escalations.toml. |
Translation Notes
The translator handles the most common Snakemake patterns:
ruleblocks →[[rule]]sectionsinput/output→inputs/outputsexpand()calls → OxyMake wildcard{sample}syntaxparams→[rule.params]shell→command
Complex Python logic inside Snakefiles (e.g., run: blocks, conditional
inputs, lambda wildcards) may require manual adjustment after translation.
Review the generated TOML and run ox lint to verify.
Examples
# Quick migration — produces Snakefile.translated.toml
ox translate Snakefile
ox lint -f Snakefile.translated.toml # Verify the result
ox plan -f Snakefile.translated.toml # Check execution plan
# Custom output path
ox translate Snakefile -o Oxymakefile.toml
# CI gate: fail the job when escalations were emitted
ox translate Snakefile || echo "needs manual review"
See Also
- Oxymakefile Format — full TOML reference
- ox lint — validate the generated file
ox query
Query the dependency graph using Bazel-style expressions.
Usage
ox query <EXPRESSION> [OPTIONS]
Expressions
| Expression | Description |
|---|---|
deps(X) | All transitive dependencies of target X |
rdeps(X) | All targets that transitively depend on X |
allpaths(X, Y) | All paths from X to Y in the DAG |
Options
| Flag | Description |
|---|---|
--json | Output JSON instead of human-readable text |
-f, --file <FILE> | Oxymakefile path (default: Oxymakefile.toml) |
Examples
# What does annotate depend on?
ox query 'deps(annotate)'
# What depends on the data rule? (reverse dependencies)
ox query 'rdeps(data)'
# All paths from data to annotate
ox query 'allpaths(data, annotate)'
# JSON output for programmatic use
ox query 'deps(annotate)' --json
See Also
ox export
Export an Oxymakefile to another workflow format.
Usage
ox export <FORMAT> [OPTIONS]
Formats
| Format | Description |
|---|---|
snakemake | Export to Snakemake format (Snakefile + config.yaml) |
Options
| Flag | Description |
|---|---|
-f, --file <FILE> | Path to the Oxymakefile (default: Oxymakefile.toml) |
-o, --output <FILE> | Write output to a file instead of stdout |
Examples
# Export to stdout
ox export snakemake
# Export to file
ox export snakemake -o Snakefile
# Export a specific Oxymakefile
ox export snakemake -f pipelines/Oxymakefile.toml -o Snakefile
Bidirectional Translation
OxyMake supports bidirectional Snakemake translation:
- Import:
ox translate Snakefileconverts Snakemake to OxyMake TOML - Export:
ox export snakemakeconverts OxyMake TOML back to Snakemake
This enables zero-friction migration in both directions.
See Also
- ox translate -- import from Snakemake
- Oxymakefile Format -- the OxyMake workflow format
Configuration
OxyMake uses a layered configuration system. Workflow-level settings live in
Oxymakefile.toml, and project-level settings live in .oxymake/config.toml.
Workflow Configuration
The [config] section in Oxymakefile.toml defines variables for wildcard
expansion:
[config]
samples = ["A", "B", "C"]
models = ["linear", "ridge"]
These values drive wildcard resolution in rules.
Project Settings
The .oxymake/config.toml file (created by ox init) stores project-level
defaults:
[defaults]
jobs = 4 # Default -j value
executor = "local" # Default executor
materialize = "always" # Default materialization policy
[cache]
dir = ".oxymake/cache" # Cache directory location
max_size_gb = 10 # Maximum cache size
[state]
dir = ".oxymake" # State directory
Environment Variables
OxyMake respects the following environment variables:
| Variable | Description | Default |
|---|---|---|
OXYMAKE_JOBS | Default parallelism | 1 |
OXYMAKE_EXECUTOR | Default executor | local |
OXYMAKE_CACHE_DIR | Cache directory | .oxymake/cache |
OXYMAKE_LOG | Log level | warn |
OX_CACHE_VALIDATION | Cache validation strategy (mtime, mtime+hash, hash) | mtime+hash |
Configuration Precedence
Settings are resolved in order (later overrides earlier):
- Built-in defaults
- User global config (
~/.config/oxymake/config.toml) .oxymake/config.toml- Environment variables
- Command-line flags
State Directory
The .oxymake/ directory contains:
.oxymake/
state.db # SQLite execution state + audit log
cache/ # Content-addressable output cache
config.toml # Project settings
The state database (state.db) uses SQLite WAL mode for concurrent access.
It must reside on local disk (not NFS/Lustre/GPFS).
Next Steps
- Oxymakefile Format -- workflow definition reference
- CLI Commands -- command reference
- Expression Language -- expression syntax
Expression Language
OxyMake includes a minimal expression language for conditional guards and dynamic values in workflow definitions. The language is deliberately limited: pure functions, no loops, no side effects.
Guard Expressions
The when field on a rule accepts a boolean expression:
[rule.expensive_model]
output = ["results/{seed}_{model}.txt"]
shell = "train --seed {seed} --model {model}"
when = "seed in @selected_seeds"
If the guard evaluates to false, the job is not created in the DAG.
Supported Operators
Membership
when = "sample in @high_priority_samples" # Check if wildcard is in a config list
when = "model in ['linear', 'ridge']" # Check against inline list
Comparison
when = "wildcards.threshold >= 0.5"
when = "wildcards.replicate != 'control'"
Logical
when = "sample in @fast_samples and model == 'linear'"
when = "not (sample in @excluded)"
Variable References
Wildcards
Access wildcard values with bare names or the wildcards. prefix:
shell = "process {sample}" # Bare wildcard in commands
when = "wildcards.sample in @selected" # Explicit prefix in guards
Config References
Reference config arrays with @:
when = "sample in @priority_samples" # @name refers to config.name
Built-in Variables
| Variable | Description |
|---|---|
{input} | Resolved input path(s) |
{output} | Resolved output path(s) |
{wildcards.NAME} | Resolved wildcard value |
{params.NAME} | Rule parameter value |
{rule} | Rule name |
String Interpolation
In shell, run, and script fields, {braces} perform string
interpolation:
shell = "python process.py --input {input} --output {output} --sample {wildcards.sample}"
Double braces {{ and }} produce literal braces (useful in Python code):
run = """
result = {{"key": "value"}}
"""
Design Philosophy
The expression language is intentionally not Turing-complete. Complex configuration logic should happen outside the Oxymakefile:
python gen_config.py > config.toml # Generate config externally
ox run --config config.toml # Use generated config
This preserves static parseability: any tool can read an Oxymakefile without executing code.
Next Steps
- Oxymakefile Format -- complete format reference
- Rules and Wildcards -- wildcard patterns
- Tags and Filtering -- tag-based job selection