Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

OxyMake

Next-generation workflow orchestration in Rust. The uv of computational workflows.

OxyMake is a fast, declarative workflow orchestration tool that combines the proven ideas of Snakemake (file-based rules, backward-chaining DAG, wildcards) with modern engineering: content-addressable caching, polyglot execution, in-memory data passing, and first-class support for both human and AI agent users.

Key Features

  • Fast DAG resolution: 10K-job DAG resolved in 69 ms on M4 Max, 33.3× faster than Snakemake 7.32.4 (100K-job scaling out of scope for this benchmark wave; cold end-to-end is slower than Snakemake — an honest trade for content-addressable correctness)
  • Content-addressable: no phantom re-runs from git checkout or file copies
  • Polyglot: shell, Python, R, Julia — each rule chooses its language
  • Daemon-free: ox run starts, works, exits. No server to manage.
  • Agent-friendly: --json output, structured events, typed API
  • Scales: same workflow on laptop, SLURM cluster, or Ray cluster (Kubernetes designed, not yet implemented)

Quick Example

# Oxymakefile.toml
ox_version = "0.1"

[config]
samples = ["A", "B", "C"]

[rule.process]
input = ["data/{sample}.csv"]
output = ["results/{sample}.json"]
shell = "python process.py {input} {output}"

[rule.report]
input = ["results/{sample}.json"]
output = ["reports/summary.html"]
shell = "python report.py {input} > {output}"
ox run                    # build everything
ox run -j 8               # 8 parallel jobs
ox status                 # what's running?
ox plan                   # what would run?

Installation

OxyMake is a single binary called ox, written in Rust. There are several ways to install it.

Install (from source)

git clone https://github.com/noogram/oxymake.git
cd oxymake
cargo install --path crates/ox-cli

This installs both ox and oxymake to ~/.cargo/bin/. Make sure this directory is in your $PATH.

Development setup

For working on OxyMake itself:

git clone https://github.com/noogram/oxymake.git
cd oxymake
cargo build                    # debug build → target/debug/ox
cargo test --workspace         # run all tests
cargo run --bin ox -- --help   # run without installing

With just (recommended):

just build      # debug build
just test       # all tests
just demo       # interactive feature demo
just lint       # clippy checks
just ci         # full CI check (fmt + lint + test + demo)
just --list     # all available recipes

Prerequisites

Required

  • Rust 1.85+ (for installation from source)

Optional (depending on your workflow)

  • Python 3.9+ -- for rules using lang = "python"
  • uv -- for environment = { uv = "pyproject.toml" } (install uv)
  • conda/mamba -- for environment = { conda = "..." }
  • Docker -- for environment = { docker = "..." }
  • Nix -- for environment = { nix = "..." }

Verify Installation

ox --version
# ox 0.1.0

ox init
# Initialized OxyMake project in .
#   Created: Oxymakefile.toml
#   Created: .oxymake/

What Gets Installed

OxyMake is a single binary with no runtime dependencies. All state is stored in a .oxymake/ directory within your project:

your-project/
  Oxymakefile.toml       # Your workflow definition
  .oxymake/
    state.db             # SQLite execution state
    cache/               # Content-addressable cache
    logs/                # Job execution logs

No daemon, no server, no background processes. Each ox run is a self-contained process that reads state, executes, writes state, and exits.

Next Steps

Now that OxyMake is installed, head to Your First Workflow to build something.

Quickstart

Get up and running with OxyMake in under five minutes. This guide covers only features that are tested and working in v0.1.0.

Install

Build and install from source (Rust 1.85+ required):

git clone https://github.com/noogram/oxymake.git
cd oxymake
cargo install --path crates/ox-cli

This installs both ox and oxymake to ~/.cargo/bin/.

Verify:

ox --version
# ox 0.1.0

Create a Project

mkdir my-pipeline
cd my-pipeline
ox init

This creates a starter Oxymakefile.toml and a .oxymake/ directory.

The generated template uses {input} and {output} placeholders for input/output file expansion, plus {config.key} for config substitution.

Your First Workflow

Create the Oxymakefile:

cat > Oxymakefile.toml << 'EOF'
ox_version = "0.1"

[config]
samples = ["A", "B"]

# Default target: require all results to exist.
[rule.all]
input = ["results/{sample}.txt"]

# Process each sample's CSV into a sorted text file.
[rule.process]
input = ["data/{sample}.csv"]
output = ["results/{sample}.txt"]
shell = "sort data/{sample}.csv > results/{sample}.txt"
EOF

Key concepts:

  • [config] defines variables. Here samples = ["A", "B"] means OxyMake will create one job per sample.
  • {sample} in paths and shell commands is replaced with each value from the config list.
  • [rule.all] is the default target. It has inputs but no outputs, so it just ensures its inputs exist.
  • Use explicit paths with config variables in shell commands (e.g., data/{sample}.csv), not {input}/{output}.

Create some input data:

mkdir -p data results
echo -e "charlie,3\nalpha,1\nbravo,2" > data/A.csv
echo -e "zulu,26\nmike,13" > data/B.csv

Validate

Check your Oxymakefile for errors:

ox lint
# Oxymakefile is valid (2 rules)

Preview (Dry Run)

See what OxyMake would do without running anything:

ox run --dry-run

Output:

Dry run: 2 job(s) would execute for 2 target(s)
  [process-B] rule=process outputs=[results/B.txt]
  [process-A] rule=process outputs=[results/A.txt]

Run

Execute the workflow:

ox run

Output:

Completed: 2 succeeded, 0 failed, 0 skipped, 0 cancelled (0.0s)

Check the results:

cat results/A.txt
# alpha,1
# bravo,2
# charlie,3

Caching

Run the same command again:

ox run

Output:

Cache: 2 of 2 job(s) up-to-date, skipping.
Completed: 0 succeeded, 0 failed, 2 skipped, 0 cancelled (0.0s)

Nothing ran. OxyMake detected that all inputs are unchanged and all outputs exist. Modify an input and re-run to see only the affected jobs execute.

Build a Specific Target

Build only one output:

rm results/A.txt
ox run results/A.txt

Only process-A runs. results/B.txt is untouched.

Multi-Step Pipeline

OxyMake resolves dependency chains automatically. Here is a two-step pipeline that uppercases text, then counts characters:

mkdir pipeline && cd pipeline

cat > Oxymakefile.toml << 'EOF'
ox_version = "0.1"

[config]
names = ["alice", "bob"]

[rule.all]
input = ["final/{name}.txt"]

[rule.uppercase]
input = ["raw/{name}.txt"]
output = ["mid/{name}.txt"]
shell = "tr '[:lower:]' '[:upper:]' < raw/{name}.txt > mid/{name}.txt"

[rule.count]
input = ["mid/{name}.txt"]
output = ["final/{name}.txt"]
shell = "wc -c < mid/{name}.txt > final/{name}.txt"
EOF

mkdir -p raw mid final
echo "hello world" > raw/alice.txt
echo "oxymake rocks" > raw/bob.txt

ox run --dry-run
# 4 jobs: uppercase-alice, uppercase-bob, count-alice, count-bob

ox run
# Completed: 4 succeeded, 0 failed, 0 skipped, 0 cancelled (0.0s)

cat final/alice.txt
# 12

OxyMake figures out that count depends on uppercase and runs them in the correct order.

Error Handling

If a job fails, OxyMake stops and reports the failure:

cat > Oxymakefile.toml << 'EOF'
ox_version = "0.1"
[rule.broken]
output = ["out.txt"]
shell = "exit 1"
EOF

ox run
# error: job broken failed: exit code 1
# Completed: 0 succeeded, 1 failed, 0 skipped, 0 cancelled (0.0s)
# Exit code: 1

Use --keep-going (or -k) to continue running independent jobs even when one fails:

cat > Oxymakefile.toml << 'EOF'
ox_version = "0.1"
[config]
items = ["ok", "fail"]

[rule.all]
input = ["out/{item}.txt"]

[rule.process]
input = ["in/{item}.txt"]
output = ["out/{item}.txt"]
shell = "if [ '{item}' = 'fail' ]; then exit 1; fi; cp in/{item}.txt out/{item}.txt"
EOF

mkdir -p in out
echo "good" > in/ok.txt
echo "bad" > in/fail.txt

ox run -k
# Completed: 1 succeeded, 1 failed, 0 skipped, 0 cancelled (0.0s)
# Exit code: 1
# out/ok.txt was created; out/fail.txt was not

Static Rules

Rules without config variables produce a single job:

cat > Oxymakefile.toml << 'EOF'
ox_version = "0.1"
[rule.greet]
output = ["greeting.txt"]
shell = "echo 'Hello OxyMake' > greeting.txt"
EOF

ox run
cat greeting.txt
# Hello OxyMake

Alternate Oxymakefile

Use -f to point to a different file:

ox run -f path/to/other.toml

Known Limitations (v0.1.0)

  • -j N (parallel execution): All jobs run sequentially regardless of the -j value.
  • --set (config override): Does not override config values.

Next Steps

  • Read Your First Workflow for a more detailed walkthrough
  • Explore ox run --help for all available options

Your First Workflow

This tutorial walks you through creating a simple 3-rule workflow from scratch. By the end, you will understand how OxyMake resolves dependencies, runs jobs, and caches results.

Step 1: Create a Project

Create a new directory and initialize OxyMake:

mkdir my-pipeline
cd my-pipeline
ox init

This creates a starter Oxymakefile.toml. We will replace its contents.

Step 2: Create Some Input Data

Create a data/ directory with two CSV files:

mkdir data

data/alice.csv:

name,score
Alice,85
Alice,92
Alice,78

data/bob.csv:

name,score
Bob,91
Bob,88
Bob,95

Step 3: Write the Workflow

Replace the contents of Oxymakefile.toml with:

ox_version = "0.1"

[config]
students = ["alice", "bob"]

# Rule 1: Compute statistics for each student
[rule.stats]
input = ["data/{student}.csv"]
output = ["results/{student}_stats.json"]
lang = "python"
run = """
import csv, json

scores = []
with open("{input}") as f:
    for row in csv.DictReader(f):
        scores.append(int(row["score"]))

stats = {
    "student": "{wildcards.student}",
    "mean": sum(scores) / len(scores),
    "min": min(scores),
    "max": max(scores),
    "count": len(scores),
}

with open("{output}", "w") as f:
    json.dump(stats, f, indent=2)
"""

# Rule 2: Combine all student stats into a summary
[rule.summary]
input = ["results/{student}_stats.json"]
output = ["results/summary.json"]
lang = "python"
run = """
import json, glob

all_stats = []
for path in sorted(glob.glob("results/*_stats.json")):
    with open(path) as f:
        all_stats.append(json.load(f))

with open("{output}", "w") as f:
    json.dump(all_stats, f, indent=2)
"""

# Rule 3: Default target -- build the summary
[rule.all]
input = ["results/summary.json"]

This workflow has three rules:

  1. stats -- computes per-student statistics (runs once per student)
  2. summary -- combines all stats into one file
  3. all -- an aggregation target that tells OxyMake what to build

Interpolation note. Inside run/shell blocks, OxyMake substitutes the placeholders it recognizes -- {input}, {output}, {wildcards.X}, {config.X}, and so on -- and leaves everything else untouched. It does not treat {{/}} as escaped braces, so write ordinary Python dict literals with single braces (stats = { ... }). The recognized placeholders are listed in the Expression Language reference.

Step 4: Plan

Before running, see what OxyMake will do:

ox plan

You should see something like:

Plan: 3 rules, 3 jobs, 2 source files
Targets: results/summary.json
  1. [stats-bob] rule=stats -> [results/bob_stats.json]
  2. [stats-alice] rule=stats -> [results/alice_stats.json]
  3. [summary] rule=summary -> [results/summary.json]

OxyMake resolved the {student} wildcard from config.students and created two concrete jobs for the stats rule (with the ids stats-alice and stats-bob), plus one for summary.

Step 5: Run

ox run

Output (timings will vary):

  Resolving 3 jobs (3 to run, 0 cached)
  ▸ summary — upstream rebuilt
  ✓ Completed 3/3 in 0.6s (4.8 jobs/s)
    3 succeeded
Completed: 3 succeeded, 0 failed, 0 skipped, 0 cancelled (0.6s)

The last line is the canonical summary: N succeeded, N failed, N skipped, N cancelled. A run is successful when failed and cancelled are both 0.

Check the results:

cat results/alice_stats.json
{
  "student": "alice",
  "mean": 85.0,
  "min": 78,
  "max": 92,
  "count": 3
}

Step 6: See Caching in Action

Run the same command again:

ox run

Output:

Cache: 3 of 3 job(s) up-to-date, skipping.
Completed: 0 succeeded, 0 failed, 3 skipped, 0 cancelled (0.0s)

Nothing ran. OxyMake detected that all inputs are unchanged and all outputs exist with the correct content hashes, so all three jobs are reported as skipped.

Now modify one input:

echo "Alice,99" >> data/alice.csv
ox run

Output:

Cache: 1 of 3 job(s) up-to-date, skipping.
  Resolving 3 jobs (2 to run, 1 cached)
  [1/3] ✓ stats-bob [cached]
  ▸ summary — upstream rebuilt
  ✓ Completed 3/3 in 0.4s (7.5 jobs/s)
    2 succeeded, 1 skipped
Completed: 2 succeeded, 0 failed, 1 skipped, 0 cancelled (0.4s)

Only stats-alice and summary re-ran. stats-bob was cached (reported as skipped) because its input did not change.

Step 7: Add a New Student

Edit Oxymakefile.toml and add a student:

[config]
students = ["alice", "bob", "charlie"]

Create the data file:

echo "name,score
Charlie,76
Charlie,82
Charlie,90" > data/charlie.csv

Run again:

ox run
Cache: 2 of 4 job(s) up-to-date, skipping.
  Resolving 4 jobs (2 to run, 2 cached)
  [1/4] ✓ stats-alice [cached]
  [2/4] ✓ stats-bob [cached]
  ▸ summary — upstream rebuilt
  ✓ Completed 4/4 in 0.4s (10.5 jobs/s)
    2 succeeded, 2 skipped
Completed: 2 succeeded, 0 failed, 2 skipped, 0 cancelled (0.4s)

Only the new student was computed. Alice and Bob's stats were cached (reported as skipped).

What You Learned

  1. Rules declare intent -- input/output patterns with wildcards
  2. Config drives expansion -- students = [...] determines which jobs are created
  3. Content-addressable caching -- unchanged inputs mean cached outputs
  4. Incremental execution -- adding data or rules only computes what is new
  5. Backward chaining -- OxyMake figures out the dependency order automatically

Next Steps

Understanding the Output

When you run ox run, OxyMake provides structured feedback about what it is doing and why. This page explains the output formats, using the 3-rule workflow from Your First Workflow (stats for two students, plus a summary).

Terminal Output (Default)

By default, OxyMake prints human-readable progress and ends with a canonical summary line (timings will vary):

  Resolving 3 jobs (3 to run, 0 cached)
  ▸ summary — upstream rebuilt
  ✓ Completed 3/3 in 0.6s (4.8 jobs/s)
    3 succeeded
Completed: 3 succeeded, 0 failed, 0 skipped, 0 cancelled (0.6s)

The last line is the canonical summary, always in the same shape:

Completed: N succeeded, N failed, N skipped, N cancelled (<elapsed>)
  • succeeded -- jobs that ran and produced their outputs
  • failed -- jobs whose command exited non-zero
  • skipped -- jobs whose outputs were already up to date (cache hits)
  • cancelled -- jobs that did not run because an upstream job failed

A run is successful (exit code 0) when both failed and cancelled are 0.

Cached Jobs

When outputs are already up to date, OxyMake skips the work and reports the jobs as skipped:

Cache: 3 of 3 job(s) up-to-date, skipping.
Completed: 0 succeeded, 0 failed, 3 skipped, 0 cancelled (0.0s)

On a partial re-run (one input changed), the cached jobs are listed and the summary reflects the split:

Cache: 1 of 3 job(s) up-to-date, skipping.
  Resolving 3 jobs (2 to run, 1 cached)
  [1/3] ✓ stats-bob [cached]
  ▸ summary — upstream rebuilt
  ✓ Completed 3/3 in 0.4s (7.5 jobs/s)
    2 succeeded, 1 skipped
Completed: 2 succeeded, 0 failed, 1 skipped, 0 cancelled (0.4s)

Plan Output

Use ox plan to see what would run without executing:

ox plan
Plan: 3 rules, 3 jobs, 2 source files
Targets: results/summary.json
  1. [stats-bob] rule=stats -> [results/bob_stats.json]
  2. [stats-alice] rule=stats -> [results/alice_stats.json]
  3. [summary] rule=summary -> [results/summary.json]

The header reports the totals (N rules, N jobs, N source files), followed by the requested targets and the concrete jobs, each shown as [job-id] rule=<rule> -> [outputs].

JSON Output (Agent Mode)

Add --json to ox run for structured NDJSON output -- one self-contained JSON event per line:

ox run --json
{"event":"run_started","total_jobs":3,"to_run":3,"cached":0}
{"event":"job_started","job_id":"stats-bob","executor":"local","reason":"cache_miss"}
{"event":"job_completed","job_id":"stats-bob","duration_ms":209,"outputs":["results/bob_stats.json"]}
{"event":"job_started","job_id":"stats-alice","executor":"local","reason":"cache_miss"}
{"event":"job_completed","job_id":"stats-alice","duration_ms":200,"outputs":["results/alice_stats.json"]}
{"event":"job_started","job_id":"summary","executor":"local","reason":"upstream_rebuilt"}
{"event":"job_completed","job_id":"summary","duration_ms":194,"outputs":["results/summary.json"]}
{"event":"run_completed","total":3,"succeeded":3,"failed":0,"skipped":0,"cancelled":0,"duration_ms":607}

Each event carries an event discriminant (run_started, job_started, job_completed, run_completed). This format is designed for AI agents and scripts to parse programmatically. Use --report-json <path> to write the same stream to a file. See Agent-Driven Workflows for details.

DAG Visualization

Use ox dag to render the dependency graph. The default format is Graphviz DOT:

ox dag
digraph oxymake {
  rankdir=LR;
  "results/summary.json" -> "all";
  "stats" -> "results/{student}_stats.json";
  "data/{student}.csv" -> "stats";
  "summary" -> "results/summary.json";
  "results/{student}_stats.json" -> "summary";
}

Other formats:

ox dag --format mermaid       # Mermaid graph syntax
ox dag --format dot           # Graphviz DOT (same as default)
ox dag --group-by rule        # Collapse nodes by field
ox dag --json                 # Structured JSON

To trace a single target's dependency chain instead, use ox explain:

ox explain results/summary.json
Dependency chain for: results/summary.json

► 1. [summary] rule=summary
     inputs:  [results/alice_stats.json, results/bob_stats.json]
     outputs: [results/summary.json]
  2. [stats-alice] rule=stats
     inputs:  [data/alice.csv]
     outputs: [results/alice_stats.json]
  3. [stats-bob] rule=stats
     inputs:  [data/bob.csv]
     outputs: [results/bob_stats.json]

Error Output

When a job fails, OxyMake reports the failure, cancels the dependent jobs, and ends with a non-zero exit code:

  Resolving 1 jobs (1 to run, 0 cached)
  [1/1] ✗ broken FAILED (exit 1)

  error: job broken failed: exit code 1
    stderr: --- stderr ---
    stderr: boom
  ✗ Completed 1/1 in <0.1s
    1 failed
  Failed: broken
Completed: 0 succeeded, 1 failed, 0 skipped, 0 cancelled (0.0s)
  Failed jobs (showing 1 of 1):
    broken: boom
  Run 'ox logs --failed' for full details.

ox logs --failed prints the full captured output of each failed job. In --json mode, the failure is reported as a job_completed event with a non-success status, so automated tooling can recover programmatically.

Verbosity Levels

Control output detail with -v:

ox run           # Normal output
ox run -v        # Verbose: job start/end, durations, and exit codes
ox run -vv       # Debug: also show each job's stdout/stderr

Next Steps

Rules and Wildcards

What is a Rule?

A rule declares a transformation: given these inputs, produce these outputs by running this command. OxyMake figures out what needs to run based on what you ask for.

[rule.process]
input = ["data/{sample}.csv"]
output = ["results/{sample}.txt"]
shell = "python process.py {input} {output}"

This single rule handles ANY sample. When you ask for results/A.txt, OxyMake matches the output pattern, extracts sample = "A", substitutes it into the input pattern to get data/A.csv, and runs the command.

Wildcards

Wildcards are placeholders in curly braces: {sample}, {cohort}, {model}. They appear in input and output file patterns.

How wildcards resolve

OxyMake uses backward chaining: start from the output you want, find which rule can produce it, extract wildcard values from the match.

You ask for: results/patient_42.txt
                     ↓
Pattern:     results/{sample}.txt
                     ↓
Extracted:   sample = "patient_42"
                     ↓
Input becomes: data/patient_42.csv

Multiple wildcards

Rules can have multiple wildcards:

[rule.analyze]
input = ["data/{cohort}/{region}.parquet"]
output = ["results/{cohort}/{region}/report.html"]
shell = "python analyze.py {input} {output}"

Wildcard expansion from config

When you have a list of known values, put them in [config]:

[config]
samples = ["A", "B", "C"]

[rule.all]
input = ["results/{sample}.txt"]

The all rule has {sample} in its inputs but no outputs — it's an aggregation target. OxyMake expands {sample} from config.samples to request results/A.txt, results/B.txt, results/C.txt.

Expansion modes

When multiple wildcards expand from config lists, the expansion can be:

[config]
samples = ["A", "B"]
conditions = ["treated", "control"]

[rule.experiment]
output = ["results/{sample}_{condition}.csv"]
expand = "product"    # default: A_treated, A_control, B_treated, B_control
ModeBehaviorCount
product (default)All combinations (Cartesian product)N × M
zipParallel pairs (lengths must match)N

Wildcard constraints

Restrict which values a wildcard can take:

[rule.process]
output = ["results/{sample}.txt"]

[rule.process.wildcard_constraints]
sample = "[A-Z][a-z0-9_]*"    # regex: starts with uppercase letter

Conditional guards

Rules can apply only to certain wildcard values:

[config]
special_samples = ["X1", "X2"]

[rule.extra_analysis]
input = ["results/{sample}.txt"]
output = ["extra/{sample}_analysis.html"]
when = "sample in @special_samples"

This rule exists only for samples X1 and X2. Other samples don't get the extra analysis — no phantom nodes in the graph, no skipped jobs.

Guards support: in @list, not in @list, == 'value', != 'value', =~ 'regex'.

The Four Execution Modes

ModeKeywordWho manages I/OIn-memory possible
Shellshell = "..."YouNo
Inline scriptrun = "..."YouNo
External scriptscript = "path"YouNo
Pure functioncall = "mod:fn"OxyMakeYes

Start with shell or run for quick prototyping. Migrate to call when your function stabilizes and you want OxyMake to optimize I/O.

See Execution Modes for details.

The Three Graphs

OxyMake uses three distinct graph representations, each at a different level of abstraction. Understanding them is key to understanding how OxyMake works — and how to debug when things go wrong.

Overview

graph TD
    A[Oxymakefile.toml] --> B["RuleGraph<br/><i>What you declared (abstract, compact)</i>"]
    B -->|"Wildcard resolution<br/>+ guard evaluation"| C["JobGraph<br/><i>What will execute (concrete, optimized)</i>"]
    C -->|"Runtime state annotation"| D["ExecGraph<br/><i>What is happening (live status)</i>"]

RuleGraph — The Logical View

The RuleGraph is what you wrote in the Oxymakefile. Each rule is a node, and edges connect rules whose output patterns match other rules' input patterns. Wildcards are NOT resolved — this is the abstract view.

A single call node represents ALL variant-call instances, not a specific one.

$ ox plan --level=rules

  data ──→ features ──→ call ──→ annotate

What you can learn from the RuleGraph:

  • Is my pipeline structure correct?
  • Are there circular dependencies?
  • Which rules depend on which?

Inspect it: ox plan --level=rules

JobGraph — The Physical Plan

The JobGraph is the RuleGraph after wildcard resolution. Every concrete job instance is a separate node. With 3 cohorts and 4 windows, a single features rule becomes 12 concrete jobs.

The JobGraph goes through optimization passes before execution:

PassWhat it does
Cache pruningMarks up-to-date jobs as "skip"
Task fusionMerges sequential call-mode jobs
Materialization eliminationRemoves unnecessary file I/O
Critical path analysisPrioritizes bottleneck jobs

These passes run internally; ox plan reports the resolved jobs after optimization. For the 3-rule workflow from Your First Workflow:

$ ox plan
Plan: 3 rules, 3 jobs, 2 source files
Targets: results/summary.json
  1. [stats-bob] rule=stats -> [results/bob_stats.json]
  2. [stats-alice] rule=stats -> [results/alice_stats.json]
  3. [summary] rule=summary -> [results/summary.json]

The header line summarizes the graph (N rules, N jobs, N source files), followed by the requested targets and the concrete jobs that would run.

What you can learn from the JobGraph:

  • How many concrete jobs will execute?
  • Which rule produced each job, and what outputs it writes?
  • Which jobs are already cached? (re-run after a build to see fewer jobs)

Inspect it: ox plan (optimized, the default), ox plan --no-optimize (skip the optimization passes), or ox plan --level rules to view the RuleGraph instead of the JobGraph.

ExecGraph — The Live Execution

The ExecGraph is the JobGraph annotated with runtime state. Each node carries its status (Pending → Running → Completed/Failed), timing, and resource usage.

$ ox status --group-by stage

  data          3/3 completed
  features      145/3412 running (12%)
  call          waiting (blocked)
  annotate      waiting

What you can learn from the ExecGraph:

  • What's running right now?
  • What failed and why?
  • How long has each job been running?
  • Which sessions are active?

Inspect it: ox status

The Relationship

Each graph is a refinement of the previous one:

PropertyRuleGraphJobGraphExecGraph
NodesRules (abstract)Concrete jobsJobs + status
WildcardsUnresolvedResolvedResolved
SizeSmall (tens)Large (thousands)Same as JobGraph
LifetimeStatic (parse time)Static (plan time)Dynamic (runtime)
Changes during runNeverGrows (checkpoints)Continuously

Vocabulary

To avoid confusion, OxyMake uses these terms consistently:

  • Rule = a declaration in the Oxymakefile (unresolved wildcards)
  • Job = a concrete, executable instance of a rule (wildcards resolved)
  • Pass = an optimization transformation on the JobGraph
  • Phase = a stage of the pipeline (parse → resolve → optimize → execute)

Content-Addressable Cache

One of the most frustrating things about traditional build tools is the phantom re-run: you check out a branch, and everything rebuilds even though nothing actually changed. OxyMake eliminates this by using file content as the source of truth, not timestamps.

How It Works

Every time OxyMake runs a job, it computes a cache key from everything that could affect the output:

cache_key = blake3(
    format_version ||
    rule_source_hash ||
    sorted((input_path, input_content_hash) pairs) ||
    params_hash ||
    env_content_hash ||
    shell_executable ||
    platform
)

Every field is length-framed with a domain-separation tag, so two different job specifications can never hash to the same key. If the key matches a previously computed result, the job is skipped. The key includes:

  • Rule source hash -- if you change the shell command, inline code, or function reference, the cache is invalidated
  • Input content hashes -- blake3 of every input file's contents, bound to its path; parameter files and (in script mode) the script file itself count as inputs, so editing script.py invalidates the cache
  • Params hash -- any parameters passed via --set or [config]
  • Environment content hash -- the content of the referenced spec file (requirements.txt, conda YAML, nix expression), or the container image reference for Docker/Apptainer
  • Shell executable -- the same command under /bin/bash and /bin/zsh can behave differently
  • Platform -- OS and architecture (a Linux build is not reusable on macOS)

Two exclusions to know about: call-mode function bodies are tracked only if you declare the module as an input, and mutable container tags are hashed as written (pin images by digest -- python@sha256:... -- if you need re-pushed tags to invalidate the cache).

Why Not Timestamps?

Timestamps lie. Here are common situations where they cause phantom re-runs in tools like Make or Snakemake:

ScenarioWhat happens to mtimeContent changed?
git checkoutReset to nowNo
cp without -pReset to nowNo
NFS clock skewArbitraryNo
CI fresh cloneAll files are "new"No
touch commandUpdatedNo

Validation Strategies (ADR-006)

OxyMake's cache validation is pluggable — you choose the right speed/correctness tradeoff for your workflow:

StrategyFlagBehavior
mtime+hash (default)--cache-validation=mtime+hashIf mtime/size differ, compute BLAKE3 hash. Fast on steady-state, correct on change.
mtime (opt-in)--cache-validation=mtimePure filesystem metadata (stat calls only). Fastest, but never verifies content — unsuitable for shared/multi-user caches.
hash--cache-validation=hashAlways compute BLAKE3 hash. Bit-exact. Required for shared/remote caches.
ox run                                  # default: mtime+hash (fast + content-verifying)
ox run --cache-validation=mtime         # Make-parity opt-in (no content check)
ox run --cache-validation=hash          # strict mode (CI)
OX_CACHE_VALIDATION=hash ox run         # via environment variable

Configure per project in Oxymakefile.toml:

[config]
cache_validation = "mtime+hash"

Remote caches automatically promote to hash regardless of the configured strategy, because mtime is not meaningful across machines.

The Cache on Disk

Cached outputs live in .oxymake/cache/, organized by hash prefix:

.oxymake/cache/
  a3/
    a3f7b2c1...   # cached output file
  b1/
    b1e9d4a8...   # another cached output

This directory is independent of the SQLite state database. You can share it across machines, back it up, or delete it without losing execution state (jobs will simply re-run and repopulate the cache).

Sharing Across Machines

Because the cache key is deterministic -- same inputs, same rule, same environment, same platform produce the same key -- you can share cached outputs via S3, GCS, or any shared filesystem:

# Production: everything cached locally
ox run

# CI: pull from shared remote cache
ox run --cache-remote s3://my-bucket/oxymake-cache

For remote caches, OxyMake adds a trust_scope to prevent cache poisoning: cached outputs from untrusted branches cannot be used by production builds.

Cache and Materialization

The cache interacts with the materialization policy:

PolicyWritten to disk?Cached?
always (default)YesYes
autoOnly if neededYes, when materialized
neverNo (memory only)No
finalOnly if DAG leafYes, when materialized

Outputs with materialize = "never" are kept in memory and never enter the cache. This is a deliberate trade-off: you get speed at the cost of reproducibility. The next ox run will recompute them.

Managing the Cache

# See cache size
ox gc --dry-run

# Limit cache to 10 GB (removes oldest entries)
ox gc --max-cache-size 10G

# Remove all cached outputs
ox clean --cache

Why This Matters

The content-addressable cache means you can:

  1. Switch branches freely without phantom re-runs
  2. Add new rules without invalidating existing cached results
  3. Share computation across machines and CI
  4. Resume interrupted runs -- completed work is preserved
  5. Trust the result -- if OxyMake says "cached," the output is bit-for-bit identical to what a fresh run would produce

Materialization Policy

When a call-mode rule produces an output, does it need to be written to disk? Not always. OxyMake lets you control this with the materialization policy, enabling significant speedups for workflows where intermediate outputs are only consumed by other call-mode rules.

The Four Policies

PolicyBehavior
always(default) Write to disk after every job. Reproducible, cacheable.
autoWrite to disk only if a downstream job needs a file (not a call peer)
neverKeep in memory only. Lost if the process dies. Not cached.
finalWrite to disk only if this output is a leaf of the DAG (a final result)

Declaring Materialization

Set the policy on individual outputs:

[rule.compute_features]
output = [{
    path = "features/{sample}.parquet",
    format = "parquet",
    materialize = "auto"
}]
call = "pipeline.features:compute_features"
lang = "python"

[rule.train_model]
output = [{
    path = "models/{sample}.pkl",
    format = "pickle",
    materialize = "always"
}]
call = "pipeline.model:train"
lang = "python"

In this example, the features DataFrame is only written to disk if a non-call downstream rule needs it as a file. The model is always saved.

Setting the Policy per Output

Materialization is declared per output in the Oxymakefile, on the structured output form:

[rule.compute_features]
# ...
output = [
    { path = "data/features.parquet", materialize = "auto" },
]

Valid values are auto (the default — write to disk only when a downstream file consumer needs it), never (keep in memory; no disk, no caching), final (write only leaf outputs), and always (write and cache every output). There is no global ox run flag that overrides the policy today; control it in the Oxymakefile per output.

Guidance for development workflows:

  • During prototyping, set materialize = "never" on intermediate outputs to iterate fast
  • For production, use the default auto (or always) for full caching and reproducibility
  • For presentations or reports, set leaf outputs to final

How It Works with call Mode

When two consecutive rules both use call mode on the local executor, OxyMake can pass data directly in memory:

compute_features  ──[DataFrame in memory]──>  train_model
     (call)                                      (call)

No file is written between them. The format field tells OxyMake how to serialize the data if materialization is needed later (e.g., for caching or for a shell-mode downstream rule).

The Flow

  1. compute_features runs and returns a DataFrame
  2. If materialize = "auto" and the next consumer is also call mode: pass the DataFrame directly in memory
  3. If materialize = "auto" and the next consumer is shell mode: write the DataFrame to disk using the parquet codec
  4. If materialize = "always": always write to disk (and cache)
  5. If materialize = "never": never write to disk (no cache)

Constraints

Not everything supports non-always materialization:

  • shell, run, and script modes always materialize. They manage their own I/O and need real files.
  • Distributed executors (SLURM, K8s) force materialization because jobs run on separate machines.
  • Non-materialized outputs are not cached. If the process dies or you restart, they will be recomputed. This is an explicit trade-off: speed vs. reproducibility.

The --materialize Flag

The CLI flag sets the floor for materialization:

Flag valueEffect
alwaysAll outputs written to disk (default behavior)
autoPer-output policy respected
neverNo outputs written (memory only, for testing)
finalOnly DAG-leaf outputs written

Practical Example

Consider a three-stage pipeline:

[rule.load_data]
output = [{ path = "data/{s}.parquet", format = "parquet", materialize = "auto" }]
call = "pipeline:load_data"
lang = "python"

[rule.compute_features]
output = [{ path = "features/{s}.parquet", format = "parquet", materialize = "auto" }]
call = "pipeline:compute_features"
lang = "python"

[rule.generate_report]
output = [{ path = "reports/{s}.html", materialize = "always" }]
call = "pipeline:generate_report"
lang = "python"

With ox run --materialize=final:

  • load_data output: kept in memory (not a leaf)
  • compute_features output: kept in memory (not a leaf)
  • generate_report output: written to disk (it is a leaf)

Only the final HTML report touches the filesystem. The intermediate parquet files exist only in memory during execution.

Tags and Filtering

Tags let you organize rules into logical groups and selectively run subsets of your workflow.

Assigning Tags

Add tags to any rule in your Oxymakefile.toml:

[rule.align]
input = ["data/{sample}.fastq"]
output = ["aligned/{sample}.bam"]
shell = "bwa mem ref.fa {input} | samtools sort > {output}"
tags = ["alignment", "compute-heavy"]

[rule.qc]
input = ["aligned/{sample}.bam"]
output = ["qc/{sample}_report.html"]
shell = "fastqc {input} -o qc/"
tags = ["qc", "fast"]

Filtering by Tag

Run only jobs matching a tag:

ox run --tag alignment        # Only alignment jobs
ox run --tag qc               # Only QC jobs
ox run --tag compute-heavy    # Only compute-heavy jobs

Exclude jobs by tag:

ox run --exclude-tag slow     # Skip slow jobs

Tag-Based DAG Views

Tags integrate with the DAG visualization:

ox dag --group-by tag         # Group nodes by tag in the DAG view
ox plan --tag alignment       # Show plan for alignment jobs only

Hierarchical Organization

Use dotted tag names for hierarchy:

tags = ["pipeline.alignment", "resource.gpu"]

This enables filtering at different levels:

ox run --tag "pipeline.*"         # All pipeline stages
ox run --tag "resource.gpu"       # Only GPU jobs

Use Cases

  • Selective re-runs: Re-run only QC after parameter changes
  • Resource-based scheduling: Tag GPU vs CPU jobs for different executors
  • Stage grouping: Organize large workflows into logical phases
  • Development iteration: Run only the stage you are working on

Next Steps

Execution Modes

OxyMake supports four ways to execute a rule, forming a spectrum from maximum flexibility to maximum optimization. All four modes coexist in the same workflow -- you pick the right one for each rule.

The Spectrum

shell       Opaque, files only, maximum flexibility
run         Inline script, files only, author manages I/O
script      External script, files only, author manages I/O
call        Pure function, files OR memory, OxyMake manages I/O

As you move from shell to call, OxyMake gains more optimization power (in-memory data passing, task fusion, automatic serialization) -- but you give up direct control over I/O.

Mode 1: shell -- Command Line

The most flexible mode. You write a shell command, and OxyMake interpolates file paths into it.

[rule.align]
input = ["data/{sample}.fastq", "refs/genome.fa"]
output = ["results/{sample}.bam"]
shell = "bwa mem -t {resources.cpu} {input[1]} {input[0]} > {output}"
resources = { cpu = 8 }

Use shell when you are wrapping an existing command-line tool. OxyMake treats the command as a black box -- it just passes file paths and checks that outputs were created.

Mode 2: run -- Inline Script

Write a short script directly in the Oxymakefile. OxyMake interpolates {input} and {output} as file paths.

[rule.analyze]
input = ["data/{sample}.csv"]
output = ["results/{sample}.json"]
lang = "python"
run = """
import pandas as pd
import json
df = pd.read_csv("{input}")
stats = df.describe().to_dict()
with open("{output}", "w") as f:
    json.dump(stats, f)
"""

Use run for rapid prototyping -- when the logic is short enough to live in the workflow file. You manage all file I/O yourself.

Mode 3: script -- External Script

Like run, but the code lives in a separate file. Keeps the Oxymakefile clean when scripts are long.

[rule.transform]
input = ["data/{sample}.parquet"]
output = ["results/{sample}.parquet"]
script = "scripts/transform.py"
environment = { uv = "pyproject.toml" }

The script receives file paths via command-line arguments or environment variables.

Mode 4: call -- Pure Function

The key innovation. Your function receives objects, not file paths, and returns objects. OxyMake handles all I/O outside the function.

[rule.compute_features]
input = [{ path = "data/{sample}.parquet", format = "parquet" }]
output = [{ path = "features/{sample}.parquet", format = "parquet", materialize = "auto" }]
call = "pipeline.features:compute_features"
lang = "python"

The Python function is pure:

import polars as pl

def compute_features(df: pl.DataFrame) -> pl.DataFrame:
    return df.with_columns(
        mean_depth=pl.col("depth").rolling_mean(20),
        depth_std=pl.col("depth").rolling_std(60),
    )

The function never reads or writes files. OxyMake:

  1. Reads the input file using the parquet codec, producing a DataFrame
  2. Calls compute_features(df) and receives the result
  3. Writes the result to disk using the parquet codec (if materialization policy requires it)

In memory mode (when both upstream and downstream are call rules on the local executor), step 1 receives the DataFrame directly from the upstream job and step 3 passes it directly to the downstream job -- zero disk I/O.

Named Arguments

For functions with multiple inputs, use named inputs:

[rule.train_model]
input = { features = "features/{sample}.parquet", config = "configs/model.yaml" }
output = { model = "models/{sample}.pkl" }
call = "pipeline.model:train"
lang = "python"
def train(features: pl.DataFrame, config: dict) -> Model:
    ...

The input keys (features, config) map to function parameter names.

When to Use Each Mode

SituationRecommended mode
Wrapping an existing CLI toolshell
Quick one-off analysisrun
Reusable script, too long for inlinescript
Pure data transformation, wants optimizationcall
Prototyping (will refactor later)run then migrate to call

The Migration Path

The natural evolution of a rule:

  1. Start with run: Write inline code during exploration
  2. Extract to script: When the code gets long, move it to a file
  3. Refactor to call: When the function stabilizes, make it pure and let OxyMake manage I/O

Each step is backward-compatible -- the outputs are the same files. The cache key changes (because the rule source changes), so the first run after migration will recompute, but subsequent runs benefit from the optimization.

Interaction with Executors

ModeLocal executorSLURM/K8s executor
shellSubprocessRemote submission
runSubprocessRemote submission
scriptSubprocessRemote submission
call (memory)In-process via Arrow IPCForced to materialize
call (file)Subprocess + codecRemote submission + codec

Distributed executors (SLURM, K8s) cannot pass objects in memory between machines, so they automatically force call mode to materialize. Your workflow does not need to change -- OxyMake handles this transparently.

Environments

Real-world workflows need specific software packages, library versions, and runtime configurations. OxyMake supports multiple environment backends that isolate each rule's execution in a reproducible environment.

Declaring an Environment

Add an environment field to any rule:

[rule.analyze]
input = ["data/{sample}.csv"]
output = ["results/{sample}.json"]
lang = "python"
environment = { uv = "pyproject.toml" }
run = """
import pandas as pd
df = pd.read_csv("{input}")
df.describe().to_json("{output}")
"""

The environment is resolved at execution time. OxyMake ensures the environment is set up before the rule runs.

Supported Backends

uv (Python)

The recommended backend for Python workflows. Uses uv to create and manage virtual environments from a pyproject.toml or requirements.txt.

environment = { uv = "pyproject.toml" }

OxyMake calls uv sync to ensure the environment matches the lockfile. The environment hash (from uv.lock) is included in the cache key, so changing a dependency invalidates affected outputs.

conda

For workflows that need non-Python packages (C libraries, R, etc.):

environment = { conda = "environment.yaml" }

OxyMake creates or updates a conda environment from the YAML specification.

Docker / OCI Containers

For maximum isolation and reproducibility:

environment = { docker = "python:3.11-slim" }

The job runs inside a container. OxyMake mounts the workspace and handles input/output file staging. The image digest is included in the cache key.

Nix

For fully reproducible builds with Nix:

environment = { nix = "flake.nix#devShell" }

Apptainer (Singularity)

For HPC environments where Docker is unavailable:

environment = { apptainer = "image.sif" }

System (default)

No isolation. Uses whatever Python/R/tools are on $PATH:

environment = { system = true }

This is the default when no environment is specified. Suitable for shell-mode rules that call system utilities.

How Isolation Works

Each environment backend follows the same lifecycle:

  1. Resolve: Determine the exact environment specification (lockfile hash, image digest, flake hash)
  2. Prepare: Create or update the environment if needed (uv sync, docker pull, conda env create)
  3. Execute: Run the job inside the environment
  4. Hash: Include the environment specification hash in the cache key

The key insight is step 4: the environment specification is part of the cache key. If you update a dependency in pyproject.toml and the lockfile changes, all rules using that environment will be recomputed.

Mixing Environments

Different rules can use different environments in the same workflow:

[rule.download]
environment = { system = true }
shell = "wget {url} -O {output}"

[rule.analyze]
environment = { uv = "pyproject.toml" }
call = "analysis:run"
lang = "python"

[rule.visualize]
environment = { conda = "envs/plotting.yaml" }
script = "scripts/plot.R"

OxyMake manages each environment independently. There is no requirement that all rules share the same environment.

Environment and Executors

ExecutorEnvironment handling
LocalEnvironment resolved on the local machine
SLURMEnvironment must be available on compute nodes
K8sDocker image used as the pod container
RayEnvironment resolved on Ray worker nodes

For SLURM, ensure that conda environments or uv projects are accessible from the compute nodes (e.g., on a shared filesystem).

Environment Caching

Environment setup can be slow (minutes for large conda environments). OxyMake caches the prepared environment and only re-creates it when the specification changes:

  • uv: Rebuilds when uv.lock changes
  • conda: Rebuilds when environment.yaml changes
  • Docker: Re-pulls when the image tag resolves to a new digest
  • Nix: Rebuilds when the flake lock changes

This means the first run may be slow (environment setup), but subsequent runs reuse the prepared environment instantly.

Executors

OxyMake separates what to run (rules, DAG) from where to run it (executors). The same workflow runs on a laptop or a thousand-node cluster with zero changes -- just switch the --executor flag.

Available Executors

ExecutorFlagBackendGPUMemory Passing
Local--executor local (default)Tokio thread poolOS-levelSame-process
SLURM--executor slurmsbatch / sacctGRESShared filesystem
Ray--executor rayRay Jobs APIFirst-classObject store (zero-copy)
Kubernetes--executor k8skube-rs (planned)Device plugin--

Local Executor

The default. Runs jobs as subprocesses on the local machine.

ox run                # single job at a time
ox run -j 8           # 8 parallel jobs

Best for development, small pipelines, and single-node execution.

SLURM Executor

Submits jobs to an HPC cluster via sbatch and polls status with sacct.

ox run --executor slurm

Features:

  • Job arrays for wildcard expansions
  • GPU scheduling via GRES
  • Resource mapping: cpu, mem, gpu map to SLURM --cpus-per-task, --mem, --gres=gpu:N

Ray Executor

Submits jobs to a Ray cluster via the Ray Jobs API. Ray provides elastic distributed execution with a shared object store for fast intermediate data passing.

Setup

Start a Ray head node (or connect to an existing cluster):

ray start --head
# Dashboard: http://127.0.0.1:8265

Run the workflow:

ox run --executor ray

Configuration

Configure the Ray executor in .oxymake/config.toml or Oxymakefile.toml:

[executor.ray]
dashboard_address = "http://127.0.0.1:8265"
working_dir = "/shared/oxymake"
poll_interval_min = "2s"
poll_interval_max = "30s"
max_submit = 10
SettingDefaultDescription
dashboard_addresshttp://127.0.0.1:8265Ray dashboard URL
working_dir.Staging directory on shared filesystem
poll_interval_min2sMinimum status polling interval
poll_interval_max30sMaximum status polling interval
max_submitunlimitedMax concurrent job submissions
autoscaler_awarefalseQuery cluster capacity before submitting

Resource Mapping

OxyMakeRayNotes
cpunum_cpusDirect mapping
memmemoryBytes
gpunum_gpusFractional GPUs supported (gpu = 0.5)
custom:*Custom resourcesArbitrary Ray custom resources

Memory Passing

When two consecutive call-mode rules run on the Ray executor, data passes through Ray's object store without disk writes. OxyMake's materialization policies map to Ray behavior:

PolicyRay Behavior
alwaysWrite to shared FS + object store
autoObject store only (materialized if downstream needs file)
neverObject store only, evicted after consumers finish
finalObject store, written to shared FS only for DAG leaves

Execution Modes

The Ray executor supports all four execution modes:

  • shell -- commands run as Ray job entrypoints
  • run -- inline scripts submitted as Ray jobs
  • script -- external scripts submitted as Ray jobs
  • call -- Python functions with object store integration

Choosing an Executor

Use CaseRecommended Executor
Development / CILocal
HPC cluster (static allocation)SLURM
Cloud / elastic GPU clustersRay
ML pipelines with in-memory passingRay
Kubernetes-native environmentsK8s (planned)

Mixed-Executor DAGs

OxyMake owns the DAG; executors are job-dispatch backends. A future enhancement will allow per-rule executor assignment, enabling mixed-executor DAGs where some rules run locally and others dispatch to Ray or SLURM.

Next Steps

OxyMake × Ray Deep Dive

OxyMake and Ray solve different halves of the distributed compute problem. OxyMake owns the what: which jobs to run, in what order, and what can be skipped. Ray owns the where: which machine, which GPU, how many cores. This page explains how the two systems fit together.

The Three Graphs Meet Ray

Before any executor sees a job, OxyMake transforms the user's declarations through three graph representations. Understanding this pipeline is essential for understanding what Ray actually receives.

Graph Transformation Pipeline

flowchart TD
    A["Oxymakefile.toml<br/><i>Declarative TOML</i>"] --> B["RuleGraph<br/><i>Abstract: wildcards intact</i>"]
    B -->|"Wildcard resolution<br/>+ guard evaluation"| C["JobGraph<br/><i>Concrete: every job instance</i>"]
    C -->|"Optimization passes"| D["Optimized JobGraph"]
    D -->|"Cache pruning removes<br/>up-to-date jobs"| E["Uncached Subgraph"]
    E -->|"generate_driver()"| F["Python Driver Script<br/><i>@ray.remote tasks +<br/>ObjectRef chaining</i>"]
    F -->|"Ray Jobs API<br/>POST /api/jobs/"| G["Ray Cluster<br/><i>Distributed execution</i>"]

    style A fill:#f9f,stroke:#333
    style F fill:#ff9,stroke:#333
    style G fill:#9ff,stroke:#333

RuleGraph — What You Wrote

The RuleGraph is the abstract view: each rule is a node, wildcards are unresolved. A single features rule represents ALL feature instances.

data ──→ features ──→ call ──→ annotate

JobGraph — What Will Execute

After wildcard resolution, each concrete job is a separate node. With 3 cohorts and 4 windows, a single features rule becomes 12 concrete jobs. The JobGraph is bipartite — job nodes and output nodes alternate:

graph LR
    subgraph "Bipartite JobGraph"
        J1["job: align-A"] -->|produces| O1["output: results/A.bam"]
        O1 -->|consumed by| J2["job: sort-A"]
        J2 -->|produces| O2["output: results/A.sorted.bam"]

        J3["job: align-B"] -->|produces| O3["output: results/B.bam"]
        O3 -->|consumed by| J4["job: sort-B"]
        J4 -->|produces| O4["output: results/B.sorted.bam"]
    end

Optimization Passes

Before any executor sees the graph, OxyMake runs optimization passes:

PassEffect
Cache pruningMarks up-to-date jobs as "skip"
Task fusionMerges sequential call-mode jobs into one
Materialization eliminationRemoves unnecessary disk I/O
Critical path analysisAnnotates the longest chain for priority

These passes run internally. ox plan reports the jobs that remain after pruning, in the standard plan format -- for a large, mostly-cached pipeline:

Plan: 12 rules, 847 jobs, 1203 source files

Only the uncached subgraph is sent to Ray.

Ray Job Packaging

Why One Ray Job, Not N

OxyMake could submit each task as a separate Ray job. Instead, it generates a single Python driver script that encodes the entire uncached DAG as @ray.remote tasks with ObjectRef dependency chaining.

flowchart LR
    subgraph "OxyMake (Rust)"
        A["Optimized JobGraph<br/>847 uncached jobs"] -->|"driver_script.rs<br/>generate_driver()"| B["driver.py<br/>~500 lines"]
    end

    subgraph "Ray Cluster"
        B -->|"Jobs API<br/>1 submission"| C["Ray Driver Process"]
        C --> D["@ray.remote task 1"]
        C --> E["@ray.remote task 2"]
        C --> F["@ray.remote task 3"]
        C --> G["..."]
        C --> H["@ray.remote task N"]
        D -.->|ObjectRef| E
        D -.->|ObjectRef| F
        E -.->|ObjectRef| H
        F -.->|ObjectRef| H
    end

    style B fill:#ff9,stroke:#333
    style C fill:#9ff,stroke:#333

Benefits of single-job packaging:

BenefitWhy
Fire-and-forgetSubmit once, Ray handles all scheduling
ObjectRef chainingUpstream outputs become implicit dependencies
Ray parallelismRay's internal scheduler optimizes task placement
Cascading cancelray job stop cascades to all tasks
Dashboard visibilityOne job with N tasks and a colored progress bar
Reduced API loadOne HTTP submission instead of hundreds

Generated Driver Structure

The Rust code in ox-exec-ray/src/driver_script.rs generates Python that looks like this:

import ray
import subprocess
import time
import json

ray.init()

@ray.remote
def run_shell(job_id, command, work_dir, *deps):
    """Run a shell command. *deps are ObjectRefs — Ray waits for them."""
    result = subprocess.run(command, shell=True, cwd=work_dir, ...)
    if result.returncode != 0:
        raise RuntimeError(f"Job {job_id} failed")
    return result.returncode

@ray.remote
def run_call(job_id, module, func_name, *deps):
    """Run a call-mode function with object store integration."""
    # ray.get() inputs from object store
    # invoke function
    # ray.put() outputs back to object store
    ...

# --- DAG encoded as ObjectRef chain ---
# Topological order, upstream refs passed as implicit dependencies

ref_0 = run_shell.options(num_cpus=8).remote(
    "align-A", "bwa mem ...", "/project"
)
ref_1 = run_shell.options(num_cpus=2).remote(
    "sort-A", "samtools sort ...", "/project",
    ref_0  # ← dependency: Ray won't start until ref_0 completes
)
ref_2 = run_shell.options(num_cpus=8).remote(
    "align-B", "bwa mem ...", "/project"
)
ref_3 = run_call.options(num_cpus=4, num_gpus=1).remote(
    "train", "pipeline.model", "train",
    ref_1, ref_2  # ← depends on both sort-A and align-B
)

# --- Collect results ---
results = {}
for ref, job_id in [(ref_0, "align-A"), (ref_1, "sort-A"), ...]:
    try:
        ray.get(ref)
        results[job_id] = {"status": "completed"}
    except Exception as e:
        results[job_id] = {"status": "failed", "error": str(e)}

# Write manifest for ox status
with open("results.json", "w") as f:
    json.dump(results, f)

The Ray dashboard shows this as 1 job with a task-level progress bar:

Ray Dashboard → Jobs → raysubmit_abc123
  Tasks: ████████░░░░░░  127/847 (15%)
  Running: 16  |  Pending: 704  |  Completed: 127

Call Mode and the Ray Object Store

This is where OxyMake and Ray truly complement each other. In call mode, OxyMake manages I/O outside the function — and on the Ray executor, that I/O goes through Ray's distributed object store instead of disk.

Data Flow: Shell vs Call vs Ray-Call

flowchart TB
    subgraph "Shell Mode (any executor)"
        S1["Job A"] -->|"write file<br/>results/A.csv"| SD[("Disk")]
        SD -->|"read file<br/>results/A.csv"| S2["Job B"]
    end

    subgraph "Call Mode (local executor)"
        C1["Job A<br/><i>compute_features(df)</i>"] -->|"Arrow IPC<br/>in-process"| C2["Job B<br/><i>train_model(features)</i>"]
    end

    subgraph "Call Mode (Ray executor)"
        R1["Job A<br/><i>@ray.remote</i>"] -->|"ray.put()<br/>→ object store"| RO[("Ray Object<br/>Store")]
        RO -->|"ray.get()<br/>zero-copy"| R2["Job B<br/><i>@ray.remote</i>"]
    end

    style SD fill:#fcc,stroke:#333
    style RO fill:#cfc,stroke:#333
ModeData between stagesDisk I/OBest for
Shell (any executor)Files on diskAlwaysCLI tools, legacy scripts
Call (local executor)Arrow IPC, in-processOptional (materialization policy)Single-node data pipelines
Call (Ray executor)ray.put()/ray.get(), object storeOptional (materialization policy)Distributed data pipelines

How Ray Call Mode Works

When a call-mode job runs on the Ray executor, OxyMake generates a wrapper script (via call_mode.rs) that integrates with the object store:

sequenceDiagram
    participant D as Driver Script
    participant OS as Ray Object Store
    participant W as Worker (call-mode task)
    participant FS as Shared Filesystem

    D->>OS: ray.put(input_data)
    Note over D: ObjectRef stored

    D->>W: run_call.remote(job_id, module, func, input_ref)
    W->>OS: ray.get(input_ref)
    Note over W: Zero-copy if same node

    W->>W: result = func(input_data)

    W->>OS: ray.put(result)
    Note over W: ObjectRef returned

    alt materialize = "always" or "final" (leaf)
        W->>FS: write result to disk
    end

    D->>D: Pass ObjectRef to downstream tasks

Materialization Policies on Ray

OxyMake's materialization policies map directly to Ray behavior:

PolicyObject StoreDisk WriteUse Case
alwaysYesYesDebugging, external tools need files
autoYesOnly if downstream needs a fileDefault — let OxyMake decide
neverYes (evicted after consumers finish)NoPure intermediates, save disk
finalYesOnly for DAG leavesPipeline outputs to disk, intermediates in memory

Example rule with materialization:

[rule.compute_features]
input = [{ path = "data/{sample}.parquet", format = "parquet" }]
output = [{ path = "features/{sample}.parquet", format = "parquet", materialize = "auto" }]
call = "pipeline.features:compute_features"
lang = "python"
resources = { cpu = 4, mem_gb = 8 }

With materialize = "auto" on the Ray executor, the features DataFrame lives in the Ray object store. If the next rule is also call-mode on Ray, data passes through the object store with zero disk I/O. If a downstream rule is shell-mode and needs a file path, OxyMake automatically materializes to disk.

The Bridge (ADR-008)

The ExecutorBridge trait formalizes the separation between OxyMake's scheduler and remote executors. It defines three communication directions:

flowchart LR
    subgraph "OxyMake (Rust)"
        S["Scheduler<br/><i>DAG owner, cache, gates</i>"]
        ST["ox status"]
    end

    subgraph "ExecutorBridge"
        direction TB
        SUB["SUBMIT<br/><i>submit_dag()</i><br/><i>map_resources()</i>"]
        MON["MONITOR<br/><i>poll_dag_status()</i><br/><i>fetch_logs()</i><br/><i>sync_results()</i><br/><i>reconnect()</i>"]
        CTL["CONTROL<br/><i>cancel_job()</i><br/><i>cancel_all()</i>"]
    end

    subgraph "Ray Cluster"
        R["Ray Jobs API<br/><i>Driver + tasks</i>"]
    end

    S -->|"uncached subgraph"| SUB
    SUB -->|"driver.py"| R
    R -->|"status, logs"| MON
    MON -->|"DagStatus"| ST
    S -->|"cancel"| CTL
    CTL -->|"ray job stop"| R

Separation of Concerns

ConcernOxyMake (Scheduler)Ray (Executor)
DAG constructionParses Oxymakefile, resolves wildcards--
Cache checkingContent-addressable (blake3)--
OptimizationCache pruning, task fusion, critical path--
Scheduling orderTopological sort, priority, gates--
Task placement--Which node, which GPU
Resource allocation--CPU, memory, GPU scheduling
Autoscaling--Scale workers up/down
Object store--Zero-copy data passing
Fault toleranceRetry strategy (OxyMake-managed)Worker failure detection

State Synchronization

After submission, OxyMake stays connected via the bridge:

  1. Submit: submit_dag() generates the driver script, submits to Ray Jobs API, writes meta.json to .oxymake/runs/{run_id}/
  2. Poll: poll_dag_status() queries Ray for per-task status, returns DagStatus with job-level completion info
  3. Sync: sync_results() writes job results (exit codes, durations, peak memory) back to OxyMake's state database
  4. Reconnect: After an OxyMake crash, reconnect() reads meta.json and reconstructs a handle to the still-running Ray job

The meta.json contract:

{
  "executor": "ray",
  "version": 1,
  "submitted_at": "2025-04-01T12:00:00Z",
  "connection": {
    "ray_address": "http://127.0.0.1:8265",
    "ray_job_id": "raysubmit_abc123"
  },
  "run_id": "run-20250401-120000",
  "total_jobs": 847,
  "active_jobs": 847,
  "skipped_jobs": 102582
}

Resource Mapping

OxyMake resources map to Ray resources via map_resources():

OxyMakeRayNotes
cpunum_cpusDirect mapping
memmemoryBytes
gpunum_gpusFractional GPUs supported (gpu = 0.5)
custom:tpuCustom resource TPUArbitrary Ray custom resources

Ray's advantage: fractional GPUs (num_gpus=0.5) enable model serving workloads where multiple inference tasks share a single GPU.

Philosophy: Complementary, Not Overlapping

OxyMake and Ray solve orthogonal problems:

DimensionOxyMakeRay
Core questionWhat to run?Where to run it?
Key innovationContent-addressable cacheDistributed object store
ConfigurationDeclarative TOMLPython API / YAML
DAG modelThree-level (Rule → Job → Exec)Flat task graph
Cacheblake3 content hashingNone (execution-only)
SchedulingTopological + priorities + gatesResource-based bin packing
StatePersistent (state.db, cache)Ephemeral (cluster lifetime)

Why Not Snakemake + Ray or Airflow + Ray?

Snakemake + Ray: Snakemake's file-based cache uses timestamps, not content hashes. It has no materialization policies, no call mode, and its Python DSL prevents static analysis. Adding Ray to Snakemake gives you distributed execution but not the optimization pipeline (task fusion, materialization elimination) that makes the combination powerful.

Airflow + Ray: Airflow is an orchestrator that owns the DAG schedule. Adding Ray as an executor gives you distributed compute, but Airflow's DAG model is runtime-defined Python, not declarative TOML. You cannot inspect or optimize an Airflow DAG before execution.

OxyMake + Ray: OxyMake's declarative format enables static analysis and optimization passes before execution. Ray provides elastic compute and zero-copy data passing during execution. Neither system steps on the other's responsibilities.

flowchart LR
    subgraph "OxyMake Responsibilities"
        direction TB
        A1["Parse Oxymakefile.toml"]
        A2["Resolve wildcards"]
        A3["Check content-addressable cache"]
        A4["Optimize: fuse, prune, eliminate"]
        A5["Generate driver script"]
        A1 --> A2 --> A3 --> A4 --> A5
    end

    subgraph "Ray Responsibilities"
        direction TB
        B1["Receive driver script"]
        B2["Schedule tasks on workers"]
        B3["Manage object store"]
        B4["Autoscale cluster"]
        B5["Report task status"]
        B1 --> B2 --> B3 --> B4 --> B5
    end

    A5 -->|"Ray Jobs API"| B1
    B5 -->|"poll_dag_status()"| A5

    style A5 fill:#ff9,stroke:#333
    style B1 fill:#9ff,stroke:#333

Quick Start

1. Start a Ray cluster

ray start --head
# Dashboard: http://127.0.0.1:8265

2. Configure OxyMake

# Oxymakefile.toml
[executor.ray]
dashboard_address = "http://127.0.0.1:8265"

3. Run your workflow on Ray

ox run --executor ray

OxyMake handles caching, DAG optimization, and driver generation. Ray handles task placement, GPU scheduling, and data passing. Your workflow file does not change.

4. Monitor execution

ox status                 # OxyMake's view (aggregated)
# or visit Ray Dashboard for task-level detail

Further Reading

OxyMake × SLURM Deep Dive

OxyMake and SLURM solve different halves of the HPC workflow problem. OxyMake owns the what: which jobs to run, in what order, and what can be skipped. SLURM owns the where: which node, how many cores, how much memory. This page explains how the two systems fit together — from job packaging through monitoring to real-cluster deployment.

The Three Graphs Meet SLURM

Before any executor sees a job, OxyMake transforms the user's declarations through three graph representations. Understanding this pipeline is essential for understanding what SLURM actually receives.

Graph Transformation Pipeline

flowchart TD
    A["Oxymakefile.toml<br/><i>Declarative TOML</i>"] --> B["RuleGraph<br/><i>Abstract: wildcards intact</i>"]
    B -->|"Wildcard resolution<br/>+ guard evaluation"| C["JobGraph<br/><i>Concrete: every job instance</i>"]
    C -->|"Optimization passes"| D["Optimized JobGraph"]
    D -->|"Cache pruning removes<br/>up-to-date jobs"| E["Uncached Subgraph"]
    E -->|"submit_dag()"| F["sbatch scripts<br/><i>Per-job or job arrays<br/>with --dependency chains</i>"]
    F -->|"sbatch --parsable<br/>+ --dependency=afterok"| G["SLURM Scheduler<br/><i>slurmctld</i>"]

    style A fill:#f9f,stroke:#333
    style F fill:#ff9,stroke:#333
    style G fill:#9ff,stroke:#333

Optimization Before Submission

Before any executor sees the graph, OxyMake runs optimization passes:

PassEffect
Cache pruningMarks up-to-date jobs as "skip"
Task fusionMerges sequential call-mode jobs into one
Materialization eliminationRemoves unnecessary disk I/O
Critical path analysisAnnotates the longest chain for priority

Only the uncached subgraph is submitted to SLURM. After pruning, ox plan reports the jobs that remain, in the standard plan format -- for a large, mostly-cached pipeline:

Plan: 12 rules, 847 jobs, 1203 source files

SLURM Job Packaging

Two Submission Modes

OxyMake supports two SLURM submission strategies, chosen automatically:

flowchart TB
    subgraph "OxyMake (Rust)"
        A["Optimized JobGraph<br/>847 uncached jobs"]
    end

    A --> DECIDE{"Same rule,<br/>many wildcards?"}

    DECIDE -->|"Yes"| ARRAY["Job Array<br/><i>1 sbatch + N tasks</i>"]
    DECIDE -->|"No"| INDIVIDUAL["Individual Jobs<br/><i>N sbatch calls with<br/>--dependency=afterok chains</i>"]

    subgraph "SLURM Cluster"
        ARRAY --> SC["slurmctld"]
        INDIVIDUAL --> SC
        SC --> C1["c1"]
        SC --> C2["c2"]
        SC --> CN["..."]
    end

    style A fill:#ff9,stroke:#333
    style SC fill:#9ff,stroke:#333

Mode 1: Individual jobs with --dependency=afterok chains. Each job gets its own sbatch script. Upstream dependencies are encoded as --dependency=afterok:JOBID1:JOBID2. Jobs are submitted in topological order so that upstream SLURM IDs are known before downstream jobs reference them. Cached upstream jobs are omitted — their outputs already exist on the shared filesystem, so no SLURM dependency is needed.

Mode 2: Job arrays for wildcard-expanded rules. When a single rule (e.g., process) expands to many concrete jobs via wildcards, OxyMake packages them as a single SLURM job array. One sbatch call submits all tasks. Each task reads its parameters from a JSON-lines file indexed by SLURM_ARRAY_TASK_ID.

Why --dependency=afterok Chains?

Unlike the Ray executor (which generates a single driver script), the SLURM executor submits one sbatch per job (or job array) and lets SLURM's own scheduler enforce ordering:

BenefitWhy
Native SLURM schedulingslurmctld handles priority, backfill, preemption
Cluster-native visibilityEvery job appears in squeue and sacct
Granular accountingPer-job CPU time, memory, node assignment
Standard cancellationscancel works on individual jobs
Fair-share integrationJobs participate in the cluster's fair-share scheduler

Generated Job Script Structure

The Rust code in ox-exec-slurm/src/job_script.rs generates bash scripts that look like this:

#!/bin/bash
#SBATCH --job-name=ox_process_j-042
#SBATCH --output=/scratch/staging/run-001/j-042/slurm-%j.out
#SBATCH --error=/scratch/staging/run-001/j-042/slurm-%j.err
#SBATCH --partition=gpu
#SBATCH --account=my-lab
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --gpus=1
#SBATCH --time=01:06:00

# --- Environment setup ---
set -euo pipefail
module load conda 2>/dev/null || true
eval "$(conda shell.bash hook)"
conda activate ml-env

# --- Working directory ---
cd "/data/projects/my-pipeline"

# --- Execute command ---
python train.py --sample=s01 --output=results/s01.parquet

Key design decisions:

  • cd to project directory (not staging dir) so that relative output paths resolve to the same locations as the local executor — essential for cache correctness.
  • Job name truncated to 255 characters (SLURM's limit).
  • set -euo pipefail so failures propagate immediately.
  • --time derived from job timeout with a 10% buffer if not explicitly set via the time resource.

Job Array Script Structure

For wildcard-expanded rules, OxyMake generates an array script with a parameter file:

#!/bin/bash
#SBATCH --job-name=ox_array_align
#SBATCH --array=0-4%2
#SBATCH --output=/scratch/staging/slurm-%A_%a.out
#SBATCH --error=/scratch/staging/slurm-%A_%a.err
#SBATCH --partition=gpu
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G

# --- Environment setup ---
set -euo pipefail

# --- Working directory ---
cd "/data/projects/pipeline"

# --- Array task dispatch ---
PARAMS_FILE="$(dirname "$0")/array_params.jsonl"
TASK_LINE=$(sed -n "$((SLURM_ARRAY_TASK_ID + 1))p" "$PARAMS_FILE")

# Export wildcard values as environment variables
export OX_JOB_ID=$(echo "$TASK_LINE" | python3 -c 'import sys,json; print(json.load(sys.stdin)["job_id"])')
export OX_WC_sample=$(echo "$TASK_LINE" | python3 -c 'import sys,json; print(json.load(sys.stdin)["wildcards"]["sample"])')

# --- Execute command ---
TASK_CMD=$(echo "$TASK_LINE" | python3 -c 'import sys,json; print(json.load(sys.stdin)["command"])')
eval "$TASK_CMD"

The companion array_params.jsonl:

{"index":0,"job_id":"j-1","wildcards":{"sample":"A"},"command":"bwa mem -t 8 ref.fa data/A.fq > results/A.bam"}
{"index":1,"job_id":"j-2","wildcards":{"sample":"B"},"command":"bwa mem -t 8 ref.fa data/B.fq > results/B.bam"}
{"index":2,"job_id":"j-3","wildcards":{"sample":"C"},"command":"bwa mem -t 8 ref.fa data/C.fq > results/C.bam"}

The %2 suffix in --array=0-4%2 throttles to 2 concurrent tasks (configurable via job_array.max_concurrent).

The Bridge (ADR-008)

The Executor trait formalizes the separation between OxyMake's scheduler and remote executors. The SLURM executor implements these communication directions:

flowchart LR
    subgraph "OxyMake (Rust)"
        S["Scheduler<br/><i>DAG owner, cache, gates</i>"]
        ST["ox status"]
    end

    subgraph "Executor Trait"
        direction TB
        INIT["INIT<br/><i>init(), health_check()</i>"]
        SUB["SUBMIT<br/><i>submit_dag(), execute()</i>"]
        MON["MONITOR<br/><i>poll_status()</i>"]
        CTL["CONTROL<br/><i>cancel(), cleanup()</i>"]
    end

    subgraph "SLURM Cluster"
        R["slurmctld<br/><i>sbatch, sacct, squeue</i>"]
    end

    S -->|"uncached subgraph"| SUB
    SUB -->|"sbatch --parsable"| R
    R -->|"sacct/squeue"| MON
    MON -->|"JobStatus"| ST
    S -->|"cancel"| CTL
    CTL -->|"scancel"| R
    INIT -->|"sinfo --version"| R

Separation of Concerns

ConcernOxyMake (Scheduler)SLURM (Executor)
DAG constructionParses Oxymakefile, resolves wildcards--
Cache checkingContent-addressable (blake3)--
OptimizationCache pruning, task fusion, critical path--
Job packagingGenerates sbatch scripts, dependency chains--
Task placement--Which node, backfill scheduling
Resource allocation--CPU, memory, GPU, GRES scheduling
Fair-share--Multi-user priority, QOS enforcement
Node managementFailed node exclusion listNode health, drain/resume

State Synchronization

After submission, OxyMake stays connected via adaptive polling:

sequenceDiagram
    participant OX as OxyMake Scheduler
    participant FS as Shared Filesystem
    participant SC as slurmctld
    participant C as Compute Nodes

    OX->>FS: Write sbatch scripts to staging_dir
    OX->>SC: sbatch --parsable job.sh
    SC-->>OX: 12345 (SLURM job ID)
    OX->>FS: Write meta.json

    loop Adaptive Polling (5s-60s)
        OX->>SC: sacct -j 12345 --parsable2
        SC-->>OX: 12345|RUNNING|0:0|512M|00:05:30|c1
        Note over OX: State change → reset backoff
    end

    alt sacct unavailable
        OX->>SC: squeue -j 12345 -h -o %T
        SC-->>OX: RUNNING
    end

    alt Job terminal (COMPLETED/FAILED)
        OX->>OX: Map SLURM state → JobResult
        OX->>FS: Collect slurm-*.out/err logs
        OX->>FS: Clean staging directory
    end

Adaptive backoff prevents overloading slurmctld:

  • Start at 5 seconds (configurable via poll_interval_min)
  • Multiply by 1.5× each poll with no state change
  • Cap at 60 seconds (configurable via poll_interval_max)
  • Reset to minimum on any state change
  • Batch queries: sacct -j id1,id2,...,idN — one call for all jobs

The meta.json contract:

{
  "executor": "slurm",
  "version": 1,
  "run_id": "run-20250401-120000",
  "total_jobs": 847,
  "active_jobs": 847,
  "skipped_jobs": 102582,
  "job_mapping": {
    "align-A": "12345",
    "align-B": "12346",
    "sort-A": "12347"
  }
}

Resource Mapping

OxyMake resources map to SLURM #SBATCH directives via resource_mapper.rs:

OxyMakeSLURMNotes
cpu--cpus-per-taskPer-task CPU cores
mem--memTotal memory per node (e.g., "8G")
mem_mb--memMemory in MB (auto-appends M suffix)
mem_per_cpu--mem-per-cpuMemory per CPU core
gpu--gpusGPU count
gres--gresGeneric resources (e.g., "gpu:2")
nodes--nodesNode count (multi-node jobs)
tasks--ntasksMPI task count
ntasks_per_node--ntasks-per-nodeTasks per node
partition--partitionSLURM partition
time--timeWall time limit (HH:MM:SS)
qos--qosQuality of Service

Mutual exclusion: --mem and --mem-per-cpu cannot both be specified. OxyMake validates this at submission time and returns a clear error.

Timeout derivation: If no explicit time resource is set but the job has a timeout, OxyMake derives --time with a 10% buffer. A 1-hour timeout becomes --time=01:06:00.

[rule.train]
output = ["model/weights.pt"]
resources = { cpu = 8, mem = "32G", gpu = 2, time = "4:00:00" }
environment = { conda = "torch-env" }
shell = "python train.py --epochs=100"

SLURM Job States

SLURM reports over a dozen job states. OxyMake maps them to four:

stateDiagram-v2
    [*] --> Queued: sbatch accepted
    Queued --> Running: Resources allocated

    Running --> Completed: Exit code 0
    Running --> Failed: Non-zero exit
    Running --> Failed: TIMEOUT
    Running --> Failed: OUT_OF_MEMORY
    Running --> Failed: NODE_FAIL
    Running --> Cancelled: scancel / PREEMPTED

    state Queued {
        PENDING
        REQUEUED
        SUSPENDED
        CONFIGURING
    }

    state Running {
        RUNNING_STATE: RUNNING
        COMPLETING
        RESIZING
    }

    state Failed {
        FAILED_STATE: FAILED
        TIMEOUT_STATE: TIMEOUT
        OOM: OUT_OF_MEMORY
        NODE_FAIL_STATE: NODE_FAIL
        BOOT_FAIL
        DEADLINE
    }

    state Cancelled {
        CANCELLED_STATE: CANCELLED
        PREEMPTED_STATE: PREEMPTED
        REVOKED
    }

Failed Node Exclusion

When a job reports NODE_FAIL or BOOT_FAIL, OxyMake:

  1. Queries sacct for the failing node's hostname
  2. Adds it to an in-memory exclusion set
  3. Passes --exclude=node1,node2 on all future sbatch submissions
  4. Reports excluded nodes when the workflow completes

This prevents cascading failures from bad hardware without requiring manual intervention.

Monitoring: sacct Primary, squeue Fallback

Status polling uses a two-tier strategy:

flowchart TD
    START["Poll job status"] --> SACCT["sacct -j ID --parsable2"]
    SACCT --> SACCT_OK{"Records<br/>found?"}
    SACCT_OK -->|"Yes"| PARSE["Parse state,<br/>exit code, memory,<br/>elapsed, node"]
    SACCT_OK -->|"No (empty or failed)"| SQUEUE["squeue -j ID -h -o %T"]
    SQUEUE --> SQ_OK{"Job in<br/>queue?"}
    SQ_OK -->|"Yes"| RUNNING["Report as<br/>Running/Queued"]
    SQ_OK -->|"No"| RETRY["Wait 2s,<br/>retry sacct"]
    RETRY --> RETRY_OK{"Found<br/>now?"}
    RETRY_OK -->|"Yes"| PARSE
    RETRY_OK -->|"No"| LOST["Report as<br/>JobNotFound"]
    PARSE --> TERMINAL{"Terminal<br/>state?"}
    TERMINAL -->|"Yes"| RESULT["Return JobResult<br/>(exit code, duration,<br/>peak memory, node)"]
    TERMINAL -->|"No"| BACKOFF["Adaptive backoff<br/>(5s → 60s)"]
    BACKOFF --> START

Why the fallback? Some HPC clusters don't have slurmdbd (the SLURM accounting daemon) configured, making sacct unavailable. squeue always works but provides less information (no exit codes, no memory stats, no elapsed time for completed jobs).

The 2-second retry handles a race condition: a job can vanish from squeue (it finished) before sacct has ingested the accounting record.

Docker Setup: Containerized SLURM Cluster

OxyMake ships a Docker Compose setup for local testing and CI:

graph TB
    subgraph "docker-compose.yml"
        MYSQL["mysql<br/><i>MariaDB 10.11</i><br/>Port 3306"]
        DBD["slurmdbd<br/><i>Accounting daemon</i><br/>Port 6819"]
        CTL["slurmctld<br/><i>Controller</i><br/>Port 6817"]
        REST["slurmrestd<br/><i>REST API gateway</i><br/>Port 6820"]
        C1["c1<br/><i>Compute node</i>"]
        C2["c2<br/><i>Compute node</i>"]
    end

    SHARED[("/work<br/><i>Shared volume</i>")]
    DATA[("/data/lab<br/><i>Host bind mount</i>")]
    JWT[("shared-slurm<br/><i>JWT key volume</i>")]

    MYSQL --> DBD
    DBD --> CTL
    CTL --> REST
    CTL --> C1
    CTL --> C2

    JWT --- CTL
    JWT --- REST
    JWT --- DBD
    SHARED --- CTL
    SHARED --- C1
    SHARED --- C2
    DATA --- CTL
    DATA --- C1
    DATA --- C2

    style SHARED fill:#cfc,stroke:#333
    style DATA fill:#cfc,stroke:#333
    style JWT fill:#ff9,stroke:#333
    style REST fill:#9ff,stroke:#333

Start the Cluster

cd tests/slurm-docker
docker compose up -d

# Wait ~20 seconds for all services to initialize
docker compose exec slurmctld sinfo -N -h
# Output:
#   c1  normal  idle
#   c2  normal  idle

Cluster Configuration

The slurm.conf defines a minimal 2-node cluster:

ClusterName=oxymake-demo
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
AccountingStorageType=accounting_storage/slurmdbd
NodeName=c[1-2] CPUs=2 RealMemory=2048 State=UNKNOWN
PartitionName=normal Nodes=c[1-2] Default=YES MaxTime=INFINITE State=UP

Key settings:

  • select/cons_tres with CR_Core: Consumable resources at the core level — each job gets exactly the cores it requests.
  • sched/backfill: Allows smaller jobs to start while larger jobs wait for resources, improving utilization.
  • slurmdbd with MariaDB: Full accounting so sacct works.

JWT Authentication Setup

The Docker cluster configures JWT authentication automatically:

  1. slurmctld generates a random 256-bit key at startup (/etc/slurm/jwt_hs256.key)
  2. The key is shared via the shared-slurm Docker volume
  3. slurmdbd and slurmrestd pick up the key and add AuthAltTypes=auth/jwt to their configuration
  4. Clients authenticate with X-SLURM-USER-TOKEN (JWT) and X-SLURM-USER-NAME headers

Generate a token for local testing:

# Generate a JWT token for user "root" (valid 1 hour)
docker compose exec slurmctld scontrol token lifespan=3600
# Output: SLURM_JWT=eyJhbGciOi...

export SLURM_JWT=eyJhbGciOi...

Port Mapping

PortServicePurpose
6817slurmctldSLURM controller API
6819slurmdbdAccounting database daemon
6820slurmrestdREST API gateway (HTTP/JSON)
3306mysqlMariaDB (slurmdbd backend)

Submit a Test Job

docker compose exec slurmctld bash -c '
  echo "#!/bin/bash
hostname
date
sleep 5
echo done" > /work/test.sh && sbatch /work/test.sh'
# Output: Submitted batch job 1

# Check status:
docker compose exec slurmctld sacct --parsable2 --noheader -o JobID,State,ExitCode
# Output: 1|COMPLETED|0:0

Teardown

docker compose down -v   # Remove containers and volumes

Two Modes: CLI vs REST API

Mode 1: CLI (sbatch / sacct)

The default and most common mode. OxyMake shells out to SLURM CLI commands. This works on any cluster where the user has SLURM in their $PATH:

# OxyMake internally runs:
sbatch --parsable job.sh                    # Submit → returns job ID
sacct -j 12345 --parsable2 -o JobID,State   # Poll status
scancel 12345                                # Cancel if needed

Pros: Works everywhere, no extra setup, respects Munge auth. Cons: One process spawn per command, rate limiting required at scale.

Mode 2: REST API (slurmrestd)

For programmatic access, SLURM provides slurmrestd — an HTTP/JSON gateway to the same operations:

# Start slurmrestd (typically done by the cluster admin)
slurmrestd -a rest_auth/local 0.0.0.0:6820

# Submit a job via HTTP
curl -X POST http://slurmctld:6820/slurm/v0.0.44/job/submit \
  -H "Content-Type: application/json" \
  -H "X-SLURM-USER-NAME: $USER" \
  -H "X-SLURM-USER-TOKEN: $SLURM_JWT" \
  -d '{
    "script": "#!/bin/bash\nhostname\ndate",
    "job": {
      "name": "ox_test",
      "partition": "normal",
      "cpus_per_task": 4,
      "memory_per_node": { "number": 8, "set": true, "infinite": false },
      "tasks": 1
    }
  }'

# Poll status
curl http://slurmctld:6820/slurm/v0.0.44/job/12345 \
  -H "X-SLURM-USER-NAME: $USER" \
  -H "X-SLURM-USER-TOKEN: $SLURM_JWT"

# Cancel
curl -X DELETE http://slurmctld:6820/slurm/v0.0.44/job/12345 \
  -H "X-SLURM-USER-NAME: $USER" \
  -H "X-SLURM-USER-TOKEN: $SLURM_JWT"

Pros: No process spawning, structured JSON responses, lower latency at scale. Cons: Requires slurmrestd to be running, JWT authentication setup, not universally available.

Both modes are supported. CLI mode is the default. To use REST mode, pass --slurm-api http://host:6820 (or set SlurmConfig::api_url). Authentication uses X-SLURM-USER-NAME (from $USER) and X-SLURM-USER-TOKEN (from $SLURM_JWT, optional).

REST API Flow

The full lifecycle of a job submitted via REST mode:

sequenceDiagram
    participant OX as OxyMake<br/>(Rust)
    participant REST as slurmrestd<br/>:6820
    participant CTL as slurmctld
    participant C as Compute Nodes<br/>(c1, c2)

    Note over OX: Generate job script<br/>+ write to staging_dir

    OX->>REST: POST /slurm/v0.0.44/job/submit<br/>Headers: X-SLURM-USER-NAME, X-SLURM-USER-TOKEN<br/>Body: {script, job: {name, partition, cpus, mem}}
    REST->>CTL: Internal SLURM protocol
    CTL-->>REST: job_id: 12345
    REST-->>OX: {"job_id": 12345}

    loop Adaptive Polling (5s–60s)
        OX->>REST: GET /slurm/v0.0.44/job/12345
        REST->>CTL: Query job state
        CTL-->>REST: Job state + metadata
        REST-->>OX: {"job_state": "RUNNING", ...}
    end

    CTL->>C: Dispatch job to node
    C->>C: Execute sbatch script
    C-->>CTL: Exit code 0

    OX->>REST: GET /slurm/v0.0.44/job/12345
    REST-->>OX: {"job_state": "COMPLETED", "exit_code": 0}

    alt Cancel needed
        OX->>REST: DELETE /slurm/v0.0.44/job/12345
        REST->>CTL: scancel 12345
    end

Environment requirement: Unlike CLI sbatch (which inherits the submitter's shell environment), the REST API starts with an empty environment. OxyMake injects default PATH and HOME variables to ensure scripts can find basic utilities.

The Bridge: OxyMake DAG → sbatch Dependency Chain

The core translation from OxyMake's DAG to SLURM's execution model:

flowchart LR
    subgraph "OxyMake DAG"
        direction TB
        A["generate-s01"]
        B["generate-s02"]
        C["process-s01"]
        D["process-s02"]
        E["merge"]
        F["report"]
        A --> C
        B --> D
        C --> E
        D --> E
        E --> F
    end

    subgraph "SLURM Submission"
        direction TB
        SA["sbatch generate-s01.sh<br/>→ SLURM 100"]
        SB["sbatch generate-s02.sh<br/>→ SLURM 101"]
        SC["sbatch --dependency=afterok:100<br/>process-s01.sh → SLURM 102"]
        SD["sbatch --dependency=afterok:101<br/>process-s02.sh → SLURM 103"]
        SE["sbatch --dependency=afterok:102:103<br/>merge.sh → SLURM 104"]
        SF["sbatch --dependency=afterok:104<br/>report.sh → SLURM 105"]
    end

    A -.-> SA
    B -.-> SB
    C -.-> SC
    D -.-> SD
    E -.-> SE
    F -.-> SF

Topological submission: Jobs are submitted in topological order. When OxyMake submits process-s01, it already knows that generate-s01 was assigned SLURM ID 100, so it can add --dependency=afterok:100.

Cached jobs are transparent: If generate-s01 is cached (outputs exist and are up-to-date), it is never submitted to SLURM. When process-s01 is submitted, its dependency list omits the cached job entirely — the outputs are already on the shared filesystem.

Environment Support on HPC

SLURM clusters have unique environment constraints:

Conda / Module System

HPC clusters use module load for software management. OxyMake generates the appropriate setup:

[rule.train.environment]
conda = "torch-env"

Generates:

module load conda 2>/dev/null || true
eval "$(conda shell.bash hook)"
conda activate torch-env

Apptainer (Not Docker)

Most HPC clusters prohibit Docker (requires root). When a Docker environment is specified with the SLURM executor, OxyMake automatically falls back to Apptainer:

[rule.inference.environment]
docker = "nvcr.io/nvidia/pytorch:24.01-py3"

Generates:

# WARNING: Docker not supported on most HPC clusters.
# Consider using Apptainer (environment = { type = "apptainer", ... }).
apptainer exec nvcr.io/nvidia/pytorch:24.01-py3

For explicit Apptainer support:

[rule.inference.environment]
apptainer = "/shared/images/pytorch-24.01.sif"

Shared Filesystem Constraint

All data — job scripts, inputs, outputs — must live on a filesystem visible to both the scheduling node and compute nodes:

flowchart LR
    subgraph "Login / Submit Node"
        OX["ox run<br/>--executor slurm"]
        DB[("state.db<br/><i>Local disk only<br/>(SQLite WAL)</i>")]
    end

    subgraph "Shared Filesystem<br/>(NFS / Lustre / GPFS)"
        STAGE["staging_dir/<br/><i>sbatch scripts</i>"]
        DATA["project/<br/><i>inputs + outputs</i>"]
    end

    subgraph "Compute Nodes"
        C1["c1: slurmd"]
        C2["c2: slurmd"]
    end

    OX --> DB
    OX -->|"write scripts"| STAGE
    STAGE -->|"read scripts"| C1
    STAGE -->|"read scripts"| C2
    C1 -->|"read/write"| DATA
    C2 -->|"read/write"| DATA

    style DB fill:#fcc,stroke:#333
    style STAGE fill:#cfc,stroke:#333
    style DATA fill:#cfc,stroke:#333

Critical constraint: state.db uses SQLite WAL mode, which does not work on network filesystems (NFS, Lustre, GPFS). The ox run process must execute on a node with local disk. Compute nodes never access state.db — they only read sbatch scripts and read/write data files on the shared filesystem.

Configuration

Configure the SLURM executor in .oxymake/config.toml or Oxymakefile.toml:

[executor.slurm]
partition = "gpu"
account = "my-lab"
qos = "high"
staging_dir = "/scratch/oxymake"
max_submit = 100
poll_interval_min = "5s"
poll_interval_max = "60s"
extra_flags = ["--mail-type=FAIL", "--mail-user=user@lab.edu"]

[executor.slurm.job_array]
enabled = true
max_array_size = 1000
max_concurrent = 50
SettingDefaultDescription
partitioncluster defaultSLURM partition
accountnoneAccount for resource accounting
qosnoneQuality of Service
staging_dir/tmp/oxymake-slurmDirectory for scripts + logs (must be shared)
max_submitunlimitedMax concurrent submitted jobs (rate limiting)
poll_interval_min5sMinimum adaptive poll interval
poll_interval_max60sMaximum adaptive poll interval
extra_flags[]Additional #SBATCH flags (passed through verbatim)
job_array.enabledtrueUse job arrays for wildcard expansions
job_array.max_array_sizeunlimitedMaximum tasks per array
job_array.max_concurrentunlimitedMax concurrent array tasks (%N throttle)

Switching to a Real Cluster

Moving from the Docker test cluster to a production HPC environment:

Grid'5000

[executor.slurm]
partition = "default"
staging_dir = "/home/$USER/oxymake-staging"
extra_flags = ["--reservation=my-reservation"]
# On a Grid'5000 frontend:
oarsub -I -t deploy -l nodes=4,walltime=2:00:00
# Then inside the reservation:
ox run --executor slurm -j 16

IDRIS (Jean Zay)

[executor.slurm]
partition = "gpu_p13"
account = "abc@v100"
qos = "qos_gpu-t3"
staging_dir = "$WORK/oxymake-staging"
extra_flags = ["--hint=nomultithread"]
# On Jean Zay:
module load python/3.11 cuda/12.1
ox run --executor slurm

GCP + Slurm-GCP

[profile.gcloud]
executor = "slurm"
partition = "batch"
account = "default"
jobs = 100

[profile.gcloud-gpu]
executor = "slurm"
partition = "gpu"
account = "default"
jobs = 20
ox run --profile gcloud

Google Cloud's HPC Toolkit deploys a SLURM cluster with autoscaling — nodes spin up on demand when jobs enter the queue and spin down when idle. The Filestore NFS mount provides the shared filesystem required by OxyMake's SLURM executor.

For a full setup guide including cluster provisioning, SSH tunneling, and cost control, see the Cloud HPC cookbook, which works a Google Cloud cluster as one concrete example.

Common Pitfalls

PitfallSolution
Polling too fastUse adaptive backoff (5s minimum). Aggressive 1s polls can get you rate-limited or banned from HPC clusters.
state.db on NFSRun ox run on a node with local disk. SQLite WAL mode fails on network filesystems.
Forgetting --parsableOxyMake always uses sbatch --parsable — raw output format varies by SLURM version and locale.
Job name too longTruncated automatically to 255 characters.
Docker on HPCOxyMake warns and substitutes apptainer exec. Use Apptainer explicitly.
sacct field truncationOxyMake uses --parsable2 which avoids field-width truncation.
sacct job step noiseOxyMake filters to main job entries only (skips 12345.batch, 12345.0).
Exit code formatsacct returns exit:signal (e.g., 137:9). OxyMake parses only the first number.
mem + mem_per_cpu conflictOxyMake validates mutual exclusion at submission time with a clear error.

Philosophy: Complementary, Not Overlapping

OxyMake and SLURM solve orthogonal problems:

DimensionOxyMakeSLURM
Core questionWhat to run?Where to run it?
Key innovationContent-addressable cacheFair-share batch scheduler
ConfigurationDeclarative TOMLslurm.conf + sbatch flags
DAG modelThree-level (Rule → Job → Exec)Flat job queue + dependencies
Cacheblake3 content hashingNone (execution-only)
SchedulingTopological + priorities + gatesBackfill + fair-share + QOS
StatePersistent (state.db, cache)Transient (job lifetime)
Data modelShared filesystem + optional object storeShared filesystem only

SLURM vs Ray: When to Use Which

DimensionSLURMRay
TargetHPC clusters (static allocation)Cloud/elastic clusters
Submissionsbatch (CLI) or slurmrestd (REST)Ray Jobs API (HTTP)
SchedulingFair-share + backfill + QOSFirst-come + autoscaler
GPU supportGRES (--gres=gpu:2)First-class (num_gpus=0.5)
Data passingShared filesystem onlyObject store (zero-copy)
Job arraysNative (--array=0-N)N/A (individual tasks)
Latency1–5s per submission~100ms per submission
Scaling modelFixed cluster, admin-managedElastic, autoscaler
Multi-userFair-share, preemption, QOSSingle-tenant by default
Best forBatch HPC, multi-user clusters, GPU schedulingML pipelines, interactive, cloud-native

Rule of thumb: Use SLURM when you have a shared HPC cluster with existing SLURM infrastructure. Use Ray when you need elastic scaling, fast job turnaround, or in-memory data passing between tasks.

Why Not Snakemake + SLURM?

Snakemake also integrates with SLURM, but with important differences:

Snakemake: Manages the DAG from a long-running process. Submits jobs one at a time as dependencies complete. Uses file timestamps for caching. Cannot do task fusion or materialization elimination.

OxyMake: Submits the entire dependency chain up front via --dependency=afterok. SLURM sees the full picture and can backfill more aggressively. Uses content hashes (not timestamps) for caching. Optimization passes (fusion, materialization elimination) reduce the number of jobs before submission.

Quick Start

1. Configure the executor

# Oxymakefile.toml
[executor.slurm]
partition = "normal"
staging_dir = "/scratch/$USER/oxymake"

2. Run your workflow on SLURM

ox run --executor slurm

OxyMake handles caching, DAG optimization, and sbatch generation. SLURM handles task placement, resource allocation, and scheduling. Your workflow file does not change.

3. Monitor execution

ox status                   # OxyMake's view (aggregated)
squeue -u $USER             # SLURM's view (per-job)
sacct -j <id> --format=...  # Detailed job accounting

4. Run the demo (Docker)

# Build OxyMake
cargo build --bin ox

# Start the test cluster and run the full demo
just demo-slurm
# Or manually:
bash tests/slurm-docker/run-demo.sh

Further Reading

Idempotent Execution

If you have used Terraform, you already understand OxyMake's execution model. ox run does not mean "launch these jobs." It means "ensure these outputs exist."

This is a fundamental design choice that affects everything from how you think about running workflows to how multiple people can work on the same pipeline simultaneously.

The Convergent Model

When you run ox run, OxyMake looks at each job in the requested subgraph and makes a decision:

Current stateWhat OxyMake does
Output exists and inputs haven't changedSkip -- nothing to do
Job is already running (another session)Attach -- wait for it, don't re-launch
Job is pending and unclaimedClaim and execute
Job failed in a previous runRe-execute

The result: running the same command twice does nothing extra. Running it while another instance is already working cooperates instead of conflicting.

ox run --rule '/human/'         # Launches the human-cohort jobs
ox run --rule '/human/'         # all skipped (cached), nothing re-runs
ox run --rule '/human/'         # (while first is running) attaches to running jobs
ox run                          # Launches yeast+mouse, attaches to human

The Terraform Analogy

If you are familiar with infrastructure-as-code tools, the mapping is direct:

TerraformOxyMakeMeaning
terraform planox planShow what would happen
terraform applyox runMake it so
terraform destroyox invalidateUndo it

Just as terraform apply creates only the resources that don't already exist, ox run executes only the jobs whose outputs are missing or stale.

Cooperative Sessions

The most powerful consequence of idempotent execution is that multiple ox run processes can work on the same project simultaneously, without conflicts.

How It Works

OxyMake uses SQLite (WAL mode) as a coordination layer. When a session wants to execute a job, it claims it atomically:

UPDATE jobs SET status = 'running', session_id = ?, locked_by = ?
WHERE id = ? AND status = 'pending';

If another session already claimed the job (0 rows affected), the current session either waits for it (if it needs the output) or moves on to other work.

Example: Two Terminals

# Terminal 1: start the human pipeline
ox run --rule '/human/'
#  Session 1: 2,100 jobs to run

# Terminal 2 (while T1 is running): start the mouse pipeline
ox run --rule '/mouse/'
#  Session 2: 3,423 jobs to run. 0 conflicts with session 1.

# Terminal 3: run everything
ox run
#  Session 3: 10,247 total jobs
#    2,100 running (human, session 1) — attaching
#    3,423 running (mouse, session 2) — attaching
#    1,312 cached (completed by sessions 1+2) — skipping
#    3,412 to run (yeast + remaining) — executing

Session 3 does not duplicate work. It attaches to what sessions 1 and 2 are already doing, skips what they have finished, and picks up the rest.

Stale Session Recovery

If a session crashes (power failure, OOM kill), its jobs are not stuck forever. Each session sends a heartbeat every few seconds. If the heartbeat is older than 2 minutes, the session is considered dead, and its running jobs are reset to pending for other sessions to claim.

No manual cleanup required.

The Lifecycle Commands

The convergent model needs symmetric operations. OxyMake provides five commands that form a complete algebra of workflow control:

CommandMeaningAnalogy
ox runEnsure outputs existterraform apply
ox cancelStop pursuing outputsCtrl+C with precision
ox invalidateForget outputs existmake clean with precision
ox planShow what would happenterraform plan
ox statusShow what is happeningkubectl get pods

Cancel

ox cancel --where cohort=human    # Stop human jobs
ox cancel --rule call             # Stop all variant calls
ox cancel --session 2             # Stop everything session 2 is doing
ox cancel                         # Stop everything

Canceled jobs have their partial outputs deleted and their status reset to pending. The next ox run will re-execute them.

Invalidate

ox invalidate --rule call                  # Delete variant-call outputs + cache entries
ox invalidate --rule call --cascade        # + all downstream outputs
ox invalidate --since "2026-03-22"         # Everything computed after this date
ox invalidate --run 3                      # Everything from run #3

The --cascade flag is important: invalidating a feature rule without cascade leaves stale calls that depend on the old feature values. With --cascade, OxyMake traverses the DAG forward and invalidates everything downstream.

Why This Matters

The idempotent execution model means:

  1. No accidental double-execution. Two people running the same command cooperate instead of conflicting.
  2. Fearless re-running. You can always run ox run again. If everything is up to date, it finishes instantly.
  3. Incremental by nature. Add new rules, change parameters, re-run. Only the affected subgraph recomputes.
  4. Crash-resilient. Completed work survives process death. Just re-run.
  5. Observable. ox status shows exactly what is happening across all sessions.

Crate Graph — How OxyMake Fits Together

A first-time contributor clones two dozen ox-* crates and needs a mental model before reading any code. This page is that model: which crate does what, which depends on which, and the one rule that keeps the whole thing legible.

If you only remember one sentence, remember this:

ox-core takes no ox-* dependency. Every other crate points inward, toward ox-core. Nothing points back out.

That is the textbook hexagonal (ports-and-adapters) shape. ox-core is the domain. The crates around it are either supporting domain libraries, driven adapters (things the engine calls — executors, storage, reports), driving adapters (things that call the engine — the CLI, the MCP server), or the composition layer that wires them together (ox-api).

This is not the same picture as the three-graph data pipeline (RuleGraph → JobGraph → ExecGraph) in The Three Graphs. That describes how a workflow is resolved at runtime. This page describes how the code is layered. Newcomers routinely conflate the two — they are orthogonal.

The shape

graph TB
    subgraph driving["Driving adapters — entrypoints (call the engine)"]
        cli["ox-cli<br/>the ox binary"]
        mcp["ox-mcp<br/>MCP server for agents"]
    end

    subgraph app["Composition layer (wires the engine together)"]
        api["ox-api<br/>embeddable Rust facade"]
    end

    subgraph support["Supporting domain libraries (depend only on ox-core)"]
        format["ox-format"]
        state["ox-state"]
        cache["ox-cache"]
        plan["ox-plan"]
        codec["ox-codec-core"]
        lock["ox-lock"]
    end

    core(["ox-core<br/>domain core — ZERO ox-* deps"])

    subgraph driven["Driven adapters (the engine calls them)"]
        execlocal["ox-exec-local"]
        execray["ox-exec-ray"]
        execslurm["ox-exec-slurm"]
        envsys["ox-env-system"]
        envuv["ox-env-uv"]
        storage["ox-storage-local"]
        repjson["ox-report-json"]
        repterm["ox-report-term"]
        render["ox-render"]
        translate["ox-translate"]
        dashboard["ox-dashboard"]
        tui["ox-monitor-tui"]
    end

    cli --> api
    cli -->|"+ every driven adapter (see table)"| driven
    mcp --> core
    mcp --> format
    mcp --> state
    mcp --> cache
    mcp --> plan
    api --> core
    api --> format
    api --> state
    api --> cache
    api --> plan

    format --> core
    state --> core
    cache --> core
    plan --> core
    codec --> core
    lock --> core

    execlocal --> codec
    execlocal --> core
    execray --> codec
    execray --> core
    execslurm --> core
    envsys --> core
    envuv --> core
    storage --> core
    repjson --> core
    repterm --> render
    repterm --> core
    translate --> format
    translate --> core
    dashboard --> state
    tui --> state

    classDef hub fill:#1f6feb,color:#fff,stroke:#0b3a8c,stroke-width:2px;
    class core hub;

Every arrow A --> B means "crate A depends on crate B". They all flow inward. ox-core has no outgoing ox-* arrow — that is the load-bearing invariant. (ox-render also has no ox-* dependency: it is a pure terminal- styling leaf that ox-report-term builds on, not a second hub.)

Roles, one line each

ox-core is the hub; the rest are grouped by their architectural role.

The hub

CrateRole
ox-coreCore types, the DAG, the scheduler, and the traits (Storage, Executor, FormatCodec, …) every adapter implements. Zero ox-* dependencies.

Supporting domain libraries (depend only on ox-core)

CrateRole
ox-formatParse and serialize the Oxymakefile.toml surface.
ox-stateRun-state persistence — the SQLite state.db.
ox-cacheContent-addressable output cache.
ox-planOptimization passes on the JobGraph — pruning, merging, scheduling hints.
ox-codec-coreThe FormatCodec trait and built-in codecs (JSON, CSV, Parquet) for in-memory data passing between jobs.
ox-lockThe reproducibility lockfile (ox.lock) — captures exact workflow state for drift detection.

Composition layer

CrateRole
ox-apiThe public, embeddable Rust facade. Composes ox-core + ox-format + ox-state + ox-cache + ox-plan into the engine. The single entry point for embedding OxyMake.

Driving adapters (entrypoints — they call the engine)

CrateRole
ox-cliThe ox binary. Depends on 21 of the 24 ox-* crates — ox-api plus every supporting library and driven adapter — everything except itself and the two not-yet-wired crates below. It is the shell that assembles the whole engine.
ox-mcpModel Context Protocol server for AI agents. Composes the same inner crates as ox-api (it does not go through ox-api).

Driven adapters (the engine calls them — each implements an ox-core trait)

CrateRole
ox-exec-localLocal-process executor.
ox-exec-rayRay-cluster executor (uses ox-codec-core for data passing).
ox-exec-slurmSLURM executor.
ox-env-systemSystem/host environment provider.
ox-env-uvuv-managed per-rule Python virtualenvs.
ox-storage-localLocal-filesystem Storage implementation.
ox-report-jsonJSON run reports.
ox-report-termTerminal run reports (builds on ox-render).
ox-renderSemantic color roles and terminal styling. No ox-* deps.
ox-translateTranslate foreign formats (Snakemake, WDL) ↔ Oxymakefile.toml (uses ox-format).
ox-dashboardWeb dashboard backend (reads ox-state).
ox-monitor-tuiTUI live monitor (reads ox-state).

Not yet wired into the ox binary

These crates compile and depend only inward, but no entrypoint consumes them yet. They are staged for a future release, not dead code.

CrateRole
ox-metricsPrometheus metrics export over ox-state.
ox-cache-remoteRemote cache backends (S3, GCS, local directory) for sharing artifacts across machines.

Outside the engine graph

CrateRole
oxymakeName-reservation crate on crates.io. It is the one publishable crate; the real engine ships as the ox binary via GitHub Releases. Not part of the dependency graph.

The exact edges (verified against cargo tree)

The table below is the authoritative ox-* → ox-* edge list. It is generated from each crate's [dependencies] and matches cargo tree -e no-dev --workspace. The diagram above shows the shape; this table is the ground truth. If you change an inter-crate dependency, update this table (and re-confirm the inward-pointing rule).

CrateDepends on (ox-* only)
ox-core(none — the hub)
ox-render(none)
ox-formatox-core
ox-stateox-core
ox-cacheox-core
ox-cache-remoteox-core
ox-planox-core
ox-codec-coreox-core
ox-lockox-core
ox-env-systemox-core
ox-env-uvox-core
ox-exec-slurmox-core
ox-storage-localox-core
ox-report-jsonox-core
ox-exec-localox-codec-core, ox-core
ox-exec-rayox-codec-core, ox-core
ox-report-termox-core, ox-render
ox-translateox-core, ox-format
ox-dashboardox-core, ox-state
ox-monitor-tuiox-core, ox-state
ox-metricsox-core, ox-state
ox-apiox-core, ox-format, ox-state, ox-cache, ox-plan
ox-mcpox-core, ox-format, ox-state, ox-cache, ox-plan
ox-cliox-api + 20 others = 21 of the 24 ox-* crates (all except itself, ox-cache-remote, ox-metrics)
oxymake(name reservation — no ox-* deps)

To regenerate this view locally:

cargo tree -e no-dev --workspace      # full dependency tree
cargo tree -e no-dev -i ox-core       # invert: who depends on ox-core (≈ everyone)

Why this matters

The inward-pointing rule is what lets you add a new executor, a new storage backend, or a new report format without touching ox-core — you implement the relevant ox-core trait in a new ox-exec-* / ox-storage-* / ox-report-* crate and register it in ox-cli (or ox-api). The domain never learns about its adapters. That is the whole point of the hexagon, and it is the project's single best legibility asset.

For the formal boundary between what OxyMake proves and what it assumes of the substrate, see the Boundary — Substrate Axioms note in the repository.

Bioinformatics Pipeline

This cookbook walks through a multi-sample FASTQ-to-BAM-to-VCF variant calling pipeline in OxyMake. The workflow uses sort, grep, and wc as stand-ins for real bioinformatics tools (BWA, samtools, GATK), so you can run it on any machine without installing anything.

The concepts transfer directly to a production pipeline: just swap the shell commands for real tool invocations.

What You Will Learn

  • Wildcard-driven sample processing across multiple samples
  • Named inputs for rules with multiple input files
  • Tags for organizing pipeline stages
  • Target-based filtering to run a subset of samples
  • --rule filtering to run a subset of stages

The Complete Oxymakefile

Create a directory and save this as Oxymakefile.toml:

ox_version = "0.1"

[config]
samples = ["NA12878", "NA12891", "NA12892"]
chromosomes = ["chr1", "chr2", "chr3"]

# ── Default target ──────────────────────────────────────────────
[rule.all]
input = ["results/cohort_report.txt"]

# ── Stage 1: Generate mock FASTQ reads ─────────────────────────
[rule.simulate_reads]
output = ["fastq/{sample}_R1.fastq", "fastq/{sample}_R2.fastq"]
tags = ["stage.simulate", "fast"]
shell = """
mkdir -p fastq
for i in $(seq 1 50); do
  echo "@{sample}_read${i}/1 chr$((i % 3 + 1)):$((i * 100))" >> {output[0]}
  echo "ACGTACGTACGTACGT" >> {output[0]}
  echo "+" >> {output[0]}
  echo "IIIIIIIIIIIIIIII" >> {output[0]}
  echo "@{sample}_read${i}/2 chr$((i % 3 + 1)):$((i * 100))" >> {output[1]}
  echo "TGCATGCATGCATGCA" >> {output[1]}
  echo "+" >> {output[1]}
  echo "IIIIIIIIIIIIIIII" >> {output[1]}
done
"""

# ── Stage 2: Align reads → sorted BAM ──────────────────────────
# Stand-in: sort the FASTQ by read name to simulate alignment + sorting.
[rule.align]
input = { r1 = "fastq/{sample}_R1.fastq", r2 = "fastq/{sample}_R2.fastq" }
output = ["aligned/{sample}.bam"]
tags = ["stage.align", "compute-heavy"]
resources = { cpu = 4, mem = "8G" }
shell = """
mkdir -p aligned
echo "## BAM for {sample}" > {output}
echo "## Aligned from {input.r1} and {input.r2}" >> {output}
cat {input.r1} {input.r2} | grep "^@" | sort >> {output}
echo "## EOF" >> {output}
"""

# ── Stage 3: Call variants per chromosome ───────────────────────
# Stand-in: grep reads matching the chromosome, count them as "variants."
[rule.call_variants]
input = { bam = "aligned/{sample}.bam" }
output = ["vcf/{sample}_{chrom}.vcf"]
tags = ["stage.call", "compute-heavy"]
resources = { cpu = 2, mem = "4G" }
shell = """
mkdir -p vcf
echo "##fileformat=VCFv4.2" > {output}
echo "##source=oxymake-cookbook" >> {output}
echo "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO" >> {output}
grep "{chrom}" {input.bam} | awk '{{
  split($2, a, ":");
  printf "%s\t%s\t.\tA\tG\t30\tPASS\tDP=20\n", a[1], a[2]
}}' >> {output}
"""

# ── Stage 4: Merge per-chromosome VCFs into one per sample ─────
[rule.merge_vcf]
input = ["vcf/{sample}_{chrom}.vcf"]
output = ["vcf/{sample}_merged.vcf"]
tags = ["stage.merge"]
shell = """
echo "##fileformat=VCFv4.2" > {output}
echo "##source=oxymake-merge" >> {output}
echo "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO" >> {output}
for f in {input}; do
  grep -v "^#" "$f" >> {output}
done
sort -k1,1 -k2,2n -o {output} {output}
"""

# ── Stage 5: Per-sample QC report ──────────────────────────────
[rule.qc]
input = { bam = "aligned/{sample}.bam", vcf = "vcf/{sample}_merged.vcf" }
output = ["qc/{sample}_report.txt"]
tags = ["stage.qc", "fast"]
shell = """
mkdir -p qc
echo "=== QC Report: {sample} ===" > {output}
echo "Total reads: $(grep -c "^@" {input.bam})" >> {output}
echo "Variants called: $(grep -vc "^#" {input.vcf})" >> {output}
echo "Chromosomes: $(grep -v "^#" {input.vcf} | cut -f1 | sort -u | tr '\n' ' ')" >> {output}
"""

# ── Stage 6: Cohort report ─────────────────────────────────────
[rule.cohort_report]
input = ["qc/{sample}_report.txt"]
output = ["results/cohort_report.txt"]
tags = ["stage.report"]
shell = """
mkdir -p results
echo "=============================" > {output}
echo " Variant Calling Cohort Report" >> {output}
echo "=============================" >> {output}
echo "" >> {output}
for f in {input}; do
  cat "$f" >> {output}
  echo "" >> {output}
done
echo "--- Summary ---" >> {output}
echo "Samples processed: $(echo {input} | wc -w | tr -d ' ')" >> {output}
"""

Create the Project

mkdir bioinfo-pipeline && cd bioinfo-pipeline
# Save the Oxymakefile.toml above
ox init   # if you want the .oxymake directory pre-created

No input data files are needed -- the simulate_reads rule generates everything from scratch.

Run the Full Pipeline

ox plan
Plan: 5 rules, 15 jobs, 3 source files
Targets: results/cohort_report.txt
  1. [simulate_reads-NA12878] rule=simulate_reads -> [fastq/NA12878_R1.fastq, fastq/NA12878_R2.fastq]
  2. [simulate_reads-NA12891] rule=simulate_reads -> [fastq/NA12891_R1.fastq, fastq/NA12891_R2.fastq]
  3. [align-NA12878] rule=align -> [aligned/NA12878.bam]
  4. [call_variants-NA12878-chr1] rule=call_variants -> [vcf/NA12878_chr1.vcf]
  ...
  15. [cohort_report] rule=cohort_report -> [results/cohort_report.txt]
ox run -j 4

OxyMake runs up to 4 jobs in parallel. The simulate_reads jobs run first (no dependencies), then align, then call_variants fans out across samples and chromosomes, and finally everything converges into the cohort report.

Filter by Sample

Run only one sample during development by requesting its leaf target (wildcards in the target select the matching jobs):

ox run "qc/NA12878_report.txt"

This builds the pipeline for NA12878 only, skipping NA12891 and NA12892. Combined with caching, this lets you iterate on pipeline logic without waiting for all samples.

Later, run the full cohort:

ox run

NA12878 is cached. Only NA12891 and NA12892 are computed.

Filter by Rule

Run only the QC stage with --rule (exact name or /regex/; assumes upstream outputs exist):

ox run --rule qc

View the DAG grouped by stage:

ox dag --group-by tag

Named Inputs

Several rules use named inputs for clarity. Compare:

# Positional (works but cryptic with multiple inputs)
input = ["aligned/{sample}.bam", "vcf/{sample}_merged.vcf"]
shell = "check {input[0]} {input[1]}"

# Named (self-documenting)
input = { bam = "aligned/{sample}.bam", vcf = "vcf/{sample}_merged.vcf" }
shell = "check {input.bam} {input.vcf}"

Named inputs make your workflow readable as it grows.

Adding a New Sample

Edit Oxymakefile.toml:

[config]
samples = ["NA12878", "NA12891", "NA12892", "NA12893"]

Run again:

ox run -j 4

Only the NA12893 jobs run. Everything else is cached.

Adapting to Real Tools

Replace the stand-in commands with real bioinformatics tools:

[rule.align]
input = { r1 = "fastq/{sample}_R1.fastq", r2 = "fastq/{sample}_R2.fastq" }
output = ["aligned/{sample}.bam"]
tags = ["stage.align", "compute-heavy"]
resources = { cpu = 8, mem = "32G" }
shell = """
bwa mem -t {resources.cpu} reference.fa {input.r1} {input.r2} \
  | samtools sort -@ 4 -o {output}
samtools index {output}
"""

The workflow structure stays the same. Only the shell commands change.

Next Steps

Climate Time-Series Pipeline

This cookbook builds a multi-station climate analysis pipeline in OxyMake. It covers feature engineering, index generation, and regional aggregation across a network of weather stations -- all driven by wildcards, snapshots, and execution history. Mock data (random readings) keeps the example self-contained.

What You Will Learn

  • Config-driven station network and parameter sweeps
  • Wildcard expansion across stations, features, and rolling windows
  • Named inputs for multi-file rules
  • Snapshots to compare analysis milestones
  • Execution history as a lightweight lab notebook
  • Tag-based filtering for fast iteration

The Complete Oxymakefile

Create a directory and save this as Oxymakefile.toml:

ox_version = "0.1"

[config]
stations = ["BOS", "DEN", "SEA", "AUS", "PDX"]
windows  = [5, 10, 20, 60]
metric   = ["trend", "anomaly"]

# ── Default target ──────────────────────────────────────────────
[rule.all]
input = ["reports/network_summary.txt"]

# ── Stage 1: Generate mock temperature readings ────────────────
[rule.mock_readings]
output = ["data/readings/{station}.csv"]
tags   = { stage = "data", speed = "fast" }
shell  = """
mkdir -p data/readings
echo "date,temp" > {output}
temp=15
for day in $(seq 1 252); do
  # Random daily temperature delta between -3 and +3 degrees
  d=$(awk "BEGIN {{srand($day * 17 + $(echo {station} | cksum | cut -d' ' -f1)); printf \"%.4f\", (rand() - 0.5) * 6}}")
  temp=$(awk "BEGIN {{printf \"%.2f\", $temp + $d}}")
  printf "2025-%03d,%s\\n" "$day" "$temp" >> {output}
done
"""

# ── Stage 2: Compute features ─────────────────────────────────
[rule.features]
input  = { readings = "data/readings/{station}.csv" }
output = ["data/features/{station}_{window}d.csv"]
tags   = { stage = "features", speed = "fast" }
shell  = """
mkdir -p data/features
echo "date,{station}_trend_{window}d,{station}_var_{window}d" > {output}
tail -n +2 {input.readings} | awk -F, -v lb={window} '
  BEGIN {{ OFS="," }}
  {{
    temps[NR] = $2
    if (NR >= lb) {{
      trend = (temps[NR] - temps[NR - lb + 1]) / lb
      sum = 0; sq = 0
      for (i = NR - lb + 1; i <= NR; i++) {{
        r = temps[i] - temps[i-1]
        sum += r; sq += r * r
      }}
      var = (sq - sum*sum/lb) / (lb - 1)
      printf "%s,%.6f,%.6f\\n", $1, trend, var
    }}
  }}
' >> {output}
"""

# ── Stage 3: Generate indices ─────────────────────────────────
[rule.indices]
input  = ["data/features/{station}_{window}d.csv"]
output = ["data/indices/{station}_{metric}.csv"]
tags   = { stage = "indices" }
shell  = """
mkdir -p data/indices
echo "date,{station}_{metric}" > {output}

if [ "{metric}" = "trend" ]; then
  # Average trend across windows → warming vs cooling stations
  paste -d, data/features/{station}_*d.csv \
    | tail -n +2 \
    | awk -F, '{{ sum=0; n=0; for(i=2;i<=NF;i+=2){{ sum+=$i; n++ }}; if(n>0) printf "%s,%.6f\\n",$1,sum/n }}' \
    >> {output}
else
  # Anomaly: deviation from the mean trend
  paste -d, data/features/{station}_*d.csv \
    | tail -n +2 \
    | awk -F, '{{ sum=0; n=0; for(i=2;i<=NF;i+=2){{ sum+=$i; n++ }}; if(n>0) printf "%s,%.6f\\n",$1,-sum/n }}' \
    >> {output}
fi
"""

# ── Stage 4: Cross-station composite index ────────────────────
[rule.composite]
input  = ["data/indices/{station}_{metric}.csv"]
output = ["data/composite/{metric}_index.csv"]
tags   = { stage = "composite" }
shell  = """
mkdir -p data/composite
echo "date,station,weight" > {output}
# Rank-based regional index: center station values cross-sectionally to zero
paste -d, data/indices/*_{metric}.csv \
  | tail -n +2 \
  | awk -F, '
    BEGIN {{ split("{station}", stations, " ") }}
    {{
      n = 0; sum = 0
      for (i = 2; i <= NF; i += 2) {{ vals[++n] = $i; sum += $i }}
      mean = sum / n
      wsum = 0
      for (i = 1; i <= n; i++) {{ w[i] = vals[i] - mean; wsum += (w[i]>0?w[i]:-w[i]) }}
      if (wsum > 0) for (i = 1; i <= n; i++) w[i] /= wsum
      for (i = 1; i <= n; i++) printf "%s,%s,%.6f\\n", $1, stations[i], w[i]
    }}
  ' >> {output}
"""

# ── Stage 5: Cumulative index score ──────────────────────────
[rule.score]
input  = {
  weights  = "data/composite/{metric}_index.csv",
  readings = "data/readings/{station}.csv"
}
output = ["data/score/{metric}_score.csv"]
tags   = { stage = "score" }
shell  = """
mkdir -p data/score
echo "date,daily_index,cumulative_index" > {output}
# Simple: weight * daily reading, summed across stations
awk -F, '
  NR == FNR && FNR > 1 {{ weights[$1,$2] = $3; next }}
  FNR > 1 {{ readings[$1] = $2 }}
' {input.weights} data/readings/*.csv

# Simplified: accumulate a weighted daily index
tail -n +2 {input.weights} | awk -F, '
  {{ idx[$1] += $3 * (rand() - 0.48) * 0.02 }}
  END {{
    cum = 0
    n = asorti(idx, dates)
    for (i = 1; i <= n; i++) {{
      cum += idx[dates[i]]
      printf "%s,%.6f,%.6f\\n", dates[i], idx[dates[i]], cum
    }}
  }}
' >> {output}
"""

# ── Stage 6: Summary report ──────────────────────────────────
[rule.report]
input  = ["data/score/{metric}_score.csv"]
output = ["reports/network_summary.txt"]
tags   = { stage = "report", speed = "fast" }
shell  = """
mkdir -p reports
echo "======================================" > {output}
echo "  Climate Network Pipeline — Summary"    >> {output}
echo "======================================" >> {output}
echo ""                                        >> {output}
echo "Network: {station}"                      >> {output}
echo "Windows: {window}"                        >> {output}
echo "Metrics: {metric}"                        >> {output}
echo ""                                         >> {output}
for f in {input}; do
  index=$(basename "$f" _score.csv)
  lines=$(tail -n +2 "$f" | wc -l | tr -d ' ')
  final=$(tail -1 "$f" | cut -d, -f3)
  echo "Index: $index"                          >> {output}
  echo "  Observation days: $lines"             >> {output}
  echo "  Final cumulative index: $final"       >> {output}
  echo ""                                       >> {output}
done
echo "--- Pipeline complete ---"               >> {output}
"""

Create the Project

mkdir climate-pipeline && cd climate-pipeline
# Save the Oxymakefile.toml above

No input data files are needed -- mock_readings generates synthetic data.

Explore the DAG

ox plan
Plan: 6 rules, 42 jobs, 5 source files
Targets: reports/network_summary.txt
  1. [mock_readings-BOS] rule=mock_readings -> [data/readings/BOS.csv]
  2. [features-BOS-5d] rule=features -> [data/features/BOS_5d.csv]
  3. [features-DEN-5d] rule=features -> [data/features/DEN_5d.csv]
  ...
  40. [composite-trend] rule=composite -> [data/composite/trend.csv]
  41. [score-trend] rule=score -> [data/scores/trend.csv]
  42. [report] rule=report -> [reports/network_summary.txt]

The DAG fans out across stations and windows, then converges through indices and regional aggregation into a single report.

Run the Full Pipeline

ox run -j 4

OxyMake runs up to 4 jobs in parallel. The mock_readings jobs run first (no dependencies), then features fans out across stations x windows, and everything converges into the network report.

Iterate on a Single Station

During development, focus on one station by requesting its leaf target (wildcards in the target select the matching jobs):

ox run "data/indices/BOS_*.csv"

This builds the pipeline for BOS only. Later, run the full network:

ox run

BOS is cached. Only the remaining stations are computed.

Filter by Rule

Run only the feature computation stage with --rule (exact name or /regex/):

ox run --rule features

Snapshots: Compare Analysis Milestones

After a successful run, save a snapshot:

ox snapshot create baseline --message "5-station trend + anomaly"

Now add a new window (120 days) and a new metric. Edit the config:

[config]
windows = [5, 10, 20, 60, 120]
metric  = ["trend", "anomaly", "seasonal"]

Run again and save another snapshot:

ox run -j 4
ox snapshot create v2 --message "Added 120d window + seasonal metric"

Compare the two milestones:

ox snapshot diff baseline v2
Workflow hash changed (config modified)
Added:    15 jobs (features/*_120d, indices/*_seasonal, ...)
Changed:  2 jobs (composite, report — new inputs)
Unchanged: 40 jobs

This tells you exactly what changed between analysis iterations without manually tracking file modifications.

Execution History as a Lab Notebook

Each ox run is recorded with timing, job counts, and optional notes:

ox run -j 4 --note "Baseline: 5 stations, 4 windows"
# ... iterate ...
ox run -j 4 --note "Added seasonal metric, 120d window"

Review your analysis timeline:

ox history
RUN          STARTED              DURATION   OK  FAIL  SKIP  NOTE
run-a1b2c3   2025-01-15 09:12     12.3s     42    0     0   Baseline: 5 stations, 4 windows
run-d4e5f6   2025-01-15 09:45      4.1s     15    0    40   Added seasonal metric, 120d window

Drill into a specific run:

ox history --run-id run-a1b2c3

This shows per-job wall time, peak memory, and exit codes -- useful for identifying bottlenecks as your network grows.

Scaling the Network

Add more stations by editing [config]:

[config]
stations = ["BOS", "DEN", "SEA", "AUS", "PDX", "ORD", "ATL", "LAX", "JFK", "MIA"]

Run again:

ox run -j 8

Only the new stations are computed. Everything else is cached. As the network grows from 5 to 50 to 500 stations, the same Oxymakefile works -- OxyMake expands the wildcards and parallelizes automatically.

Next Steps

ML Training Pipeline

Coming soon.

This page will show a machine learning training pipeline in OxyMake, covering:

  • Data preparation: feature extraction, train/test splitting, and normalization
  • Hyperparameter sweeps: wildcard-driven grid search across learning rates, architectures, and regularization parameters
  • GPU resource management: declaring GPU requirements per rule for SLURM/Kubernetes scheduling
  • Model evaluation: automated metric collection and comparison
  • In-memory passing: using call mode with Arrow IPC to pass DataFrames between feature computation and training without disk I/O

Growing a Workflow Organically

Coming soon.

This page will illustrate how real research workflows evolve over time, covering:

  • Starting small: a 3-rule exploratory workflow
  • Adding complexity incrementally: new rules never invalidate existing cached results thanks to content-addressable caching
  • Workflow composition: splitting large workflows across files with include directives
  • Snapshot milestones: saving and comparing workflow states as research progresses
  • Run annotations: using ox run --note to create a lightweight research lab notebook from the execution history

Agent-Driven Workflows

Coming soon.

This page will demonstrate how AI agents can drive OxyMake pipelines programmatically, covering:

  • Structured NDJSON events: parsing --json output for typed event streams
  • Programmatic gate approval: agents evaluating metrics and approving quality checkpoints via ox gate approve
  • Automated error recovery: detecting failures from JSON events, adjusting parameters, and retrying
  • Multi-agent coordination: multiple agents driving different stages of a pipeline
  • End-to-end example: a complete pipeline driven by an LLM agent without human intervention

Cloud HPC with SLURM

OxyMake's SLURM executor targets any SLURM cluster — on-prem, academic (Grid'5000, Jean Zay), or cloud. This guide works one concrete cloud example end-to-end: a Google Cloud cluster provisioned with the HPC Toolkit. The same shape applies to AWS ParallelCluster, Azure CycleCloud, or any managed SLURM-on-cloud offering — only the provisioning commands change; the OxyMake profile and run loop are identical. It covers cluster provisioning, profile configuration, SSH tunneling for remote access, and running pipelines end-to-end.

Prerequisites

  • A GCP project with billing enabled
  • gcloud CLI installed and authenticated (gcloud auth login)
  • Terraform >= 1.3
  • The Cloud HPC Toolkit (ghpc CLI)

Cluster Architecture

The HPC Toolkit deploys a standard SLURM cluster on GCP:

graph TD
    subgraph VPC["GCP VPC"]
        Login["Login Node<br/>(SSH entry)"] --> Controller["Controller<br/>(slurmctld)"]
        Controller --> C0["c2-0 node"]
        Controller --> C1["c2-1 node"]
        Controller --> CN["c2-N node"]
        NFS["Filestore (NFS) — /mnt/shared"]
    end

Key points:

  • Controller node runs slurmctld and schedules jobs
  • Compute nodes auto-scale — spin up when jobs are queued, shut down when idle
  • Filestore provides the shared NFS filesystem required by OxyMake's SLURM executor
  • Login node is your SSH entry point for running ox run

Step 1: Provision the Cluster

Create the blueprint

Create a file oxymake-cluster.yaml:

# oxymake-cluster.yaml — HPC Toolkit blueprint
blueprint_name: oxymake-slurm

vars:
  project_id: YOUR_PROJECT_ID
  deployment_name: oxymake-slurm
  region: us-central1
  zone: us-central1-a

deployment_groups:
  - group: primary
    modules:

      # Shared filesystem (required by OxyMake SLURM executor)
      - id: homefs
        source: modules/file-system/filestore
        settings:
          local_mount: /mnt/shared
          size_gb: 1024

      # Network
      - id: network
        source: modules/network/vpc

      # SLURM partition — general-purpose compute
      - id: compute_partition
        source: community/modules/compute/schedmd-slurm-gcp-v6-partition
        use: [network, homefs]
        settings:
          partition_name: batch
          machine_type: c2-standard-8    # 8 vCPU, 32 GB
          max_count: 10                  # Auto-scales 0 → 10 nodes
          enable_placement: false

      # GPU partition (optional)
      - id: gpu_partition
        source: community/modules/compute/schedmd-slurm-gcp-v6-partition
        use: [network, homefs]
        settings:
          partition_name: gpu
          machine_type: a2-highgpu-1g    # 1× A100
          max_count: 4
          enable_placement: false

      # SLURM controller + login node
      - id: slurm_controller
        source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
        use: [network, compute_partition, gpu_partition]
        settings:
          login_node_count: 1

      # Login node
      - id: slurm_login
        source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
        use: [network, slurm_controller]
        settings:
          machine_type: e2-standard-4

Deploy

# Generate Terraform from the blueprint
ghpc create oxymake-cluster.yaml

# Deploy
ghpc deploy oxymake-slurm

# Wait for the cluster to be ready (~5 minutes)
gcloud compute ssh oxymake-slurm-login0 --zone us-central1-a -- sinfo

You should see the batch and gpu partitions in the output.

Step 2: Configure the OxyMake Profile

Add a [profile.gcloud] section to your Oxymakefile.toml:

[profile.gcloud]
executor = "slurm"
partition = "batch"
account = "default"
jobs = 100                      # SLURM handles scheduling; allow many concurrent
keep_going = true               # Don't abort the full DAG on a single failure

[profile.gcloud-gpu]
executor = "slurm"
partition = "gpu"
account = "default"
jobs = 20

Run with the profile:

ox run --profile gcloud
ox run --profile gcloud-gpu    # For GPU workloads

Profile fields map to SLURM flags:

Profile fieldSLURM flagNotes
executor--Selects the SLURM backend
partition--partitionTarget partition (batch, gpu)
account--accountBilling/fairshare account
qos--qosQuality of service tier
jobs--OxyMake concurrency (not SLURM's)

CLI flags always override profile values: ox run --profile gcloud --partition gpu overrides the partition from batch to gpu.

Step 3: Prepare the Cluster

SSH into the login node and set up OxyMake:

gcloud compute ssh oxymake-slurm-login0 --zone us-central1-a

On the login node:

# Install OxyMake (from prebuilt binary or cargo)
curl -fsSL https://oxymake.noogram.dev/install.sh | sh
# or: cargo install oxymake

# Clone your workflow into the shared filesystem
cd /mnt/shared
git clone https://github.com/your-org/your-pipeline.git
cd your-pipeline

# Verify SLURM is accessible
sinfo                      # Should show partitions
ox run --executor slurm --dry-run   # Should show the DAG without submitting

Important: Run ox run from the login node (or controller), not from a compute node. The state.db must be on a local filesystem — /mnt/shared is NFS, so OxyMake stores state.db in a local directory by default.

Step 4: Run a Pipeline

# Dry run — see what would be submitted
ox run --profile gcloud --dry-run

# Submit the pipeline
ox run --profile gcloud

# Monitor jobs
squeue -u $USER              # SLURM's view
ox run --profile gcloud --status  # OxyMake's view (if supported)

On GCP with auto-scaling, compute nodes spin up on demand. The first run may take a few extra minutes while nodes boot. Subsequent runs are faster as nodes remain warm for the configured idle timeout (default: 5 minutes).

SSH Tunnel for Remote Access

When running OxyMake from your local machine (not SSH'd into the cluster), you can either tunnel to the SLURM CLI tools (Option A/B) or use REST mode via slurmrestd (Option C).

Add to your ~/.ssh/config:

Host oxymake-slurm
    HostName <login-node-external-ip>
    User your-username
    IdentityFile ~/.ssh/google_compute_engine
    # Or use gcloud's IAP tunnel:
    # ProxyCommand gcloud compute ssh oxymake-slurm-login0 --zone us-central1-a --tunnel-through-iap --plain -- -W %h:%p

Then SSH in and run:

ssh oxymake-slurm "cd /mnt/shared/your-pipeline && ox run --profile gcloud"

Option B: IAP Tunnel (no public IP required)

If your login node has no external IP (common for secure setups), use Identity-Aware Proxy:

# Direct SSH via IAP
gcloud compute ssh oxymake-slurm-login0 \
    --zone us-central1-a \
    --tunnel-through-iap

# Or set up a SOCKS proxy for port forwarding
gcloud compute ssh oxymake-slurm-login0 \
    --zone us-central1-a \
    --tunnel-through-iap \
    -- -D 1080 -N -f

# Forward the OxyMake dashboard port (if using ox dashboard)
gcloud compute ssh oxymake-slurm-login0 \
    --zone us-central1-a \
    --tunnel-through-iap \
    -- -L 8080:localhost:8080 -N -f

Option C: SSH Tunnel for slurmrestd

Forward the slurmrestd port to your workstation and use REST mode:

# Forward slurmrestd (port 6820) to localhost
gcloud compute ssh oxymake-slurm-login0 \
    --zone us-central1-a \
    --tunnel-through-iap \
    -- -L 6820:slurmctld:6820 -N -f

# Run OxyMake in REST mode via the tunnel
ox run --executor slurm --slurm-api http://localhost:6820

Note: REST mode requires slurmrestd to be running on the cluster. Set SLURM_JWT for JWT authentication if required by your cluster.

Cluster Lifecycle

Scale down

GCP auto-scaling shuts down idle nodes. To force-stop:

# Drain all compute nodes
scontrol update partition=batch state=DRAIN

# Or destroy the cluster entirely
ghpc destroy oxymake-slurm

Cost control

ResourceBillingTip
ControllerAlways onUse e2-standard-4 (small)
Login nodeAlways onUse e2-standard-4 (small)
Compute nodesOn-demand (auto-scale)Set max_count conservatively
FilestoreAlways on (per GB)Delete when not in use

For intermittent workloads, consider stopping the controller and login node when not running pipelines:

gcloud compute instances stop oxymake-slurm-controller --zone us-central1-a
gcloud compute instances stop oxymake-slurm-login0 --zone us-central1-a
# Restart when needed:
gcloud compute instances start oxymake-slurm-controller --zone us-central1-a
gcloud compute instances start oxymake-slurm-login0 --zone us-central1-a

Troubleshooting

SymptomCauseFix
sinfo shows no nodesCluster still provisioningWait 5 min, check gcloud compute instances list
Jobs stuck in PENDINGNo nodes available / auto-scale startingWait for nodes to boot; check sinfo -N
sbatch: command not foundNot on login/controller nodeSSH to the login node first
Permission denied on /mnt/sharedFilestore not mountedCheck mount | grep shared; re-run sudo mount
state.db lock errorRunning from NFSRun ox run from local disk on the login node
Nodes not auto-scalingPartition misconfiguredCheck scontrol show partition batch

Oxymakefile Format

OxyMake workflows are defined in Oxymakefile.toml, a declarative TOML file. This page is the complete format reference.

Top-Level Fields

ox_version = "0.1"           # Required. OxyMake format version.

Config Section

The [config] section defines workflow-level variables used for wildcard expansion:

[config]
samples = ["A", "B", "C"]
chromosomes = ["chr1", "chr2", "chr3"]
models = ["linear", "ridge", "lasso"]

Config values are arrays of strings. They drive wildcard expansion in rules.

Rule Definitions

Each rule is a [rule.<name>] table:

[rule.process]
input = ["data/{sample}.csv"]
output = ["results/{sample}.txt"]
shell = "python process.py {input} {output}"

Rule Fields

FieldTypeRequiredDescription
inputArray of stringsNoInput file patterns with {wildcards}
outputArray of stringsYesOutput file patterns with {wildcards}
shellStringOne of shell/run/script/callOpaque shell command
runStringOne of shell/run/script/callInline script (with lang)
scriptStringOne of shell/run/script/callPath to script file
callStringOne of shell/run/script/callPython function reference
langStringWith run/scriptLanguage: python, r, julia
tagsArray of stringsNoTags for filtering and grouping
resourcesTableNoResource requirements
envStringNoEnvironment to use
whenStringNoConditional guard expression
materializeStringNoalways, auto, never, final
paramsTableNoRule-specific parameters

Execution Modes

Four modes form a spectrum from flexibility to optimizability:

shell -- Opaque shell command. Maximum flexibility, no optimization.

[rule.align]
shell = "bwa mem ref.fa {input} > {output}"

run -- Inline script with language specification.

[rule.stats]
lang = "python"
run = """
import pandas as pd
df = pd.read_csv("{input}")
df.describe().to_csv("{output}")
"""

script -- External script file.

[rule.analyze]
lang = "python"
script = "scripts/analyze.py"

call -- Pure function reference. Supports in-memory Arrow IPC passing.

[rule.features]
input = [{ path = "data/{sample}.parquet", format = "parquet" }]
output = [{ path = "features/{sample}.parquet", format = "parquet", materialize = "auto" }]
call = "pipeline.features:compute_features"

Wildcards

Wildcards in {braces} are resolved from [config] arrays or inferred from existing files:

[config]
samples = ["A", "B"]

[rule.process]
input = ["data/{sample}.csv"]     # {sample} expanded from config.samples
output = ["results/{sample}.txt"]

Resources

[rule.heavy_job]
output = ["results/big.txt"]
shell = "compute_heavy"
resources = { cpus = 4, mem_gb = 16, gpu = 1, time_min = 60 }

Conditional Guards

[rule.expensive]
output = ["results/{seed}.txt"]
shell = "compute {seed}"
when = "seed in @selected_seeds"

Guards are evaluated at DAG resolution time. Jobs whose guard is false are never created.

Include Directives

Split large workflows across files:

include = ["rules/alignment.toml", "rules/qc.toml"]

Environment Specification

[env.analysis]
type = "uv"
requirements = "requirements.txt"

[rule.analyze]
env = "analysis"

Supported environment types: system, uv, conda, docker, nix.

Next Steps

CLI Commands

OxyMake provides the ox command-line tool. Every command supports --json for structured NDJSON output.

Core Commands

ox init

Initialize a new OxyMake project in the current directory.

ox init

Creates a starter Oxymakefile.toml and .oxymake/ directory.

ox run

Execute the workflow, ensuring requested outputs exist.

ox run                          # Build default targets
ox run results/report.html      # Build a specific target
ox run -j 8                     # Parallel execution (8 jobs)
ox run --rule stats             # Only run jobs from a rule (exact or /regex/)
ox run --json                   # Structured NDJSON output
ox run --note "experiment v2"   # Annotate the run
ox run --no-cache               # Ignore the cache, re-run everything

Options:

  • -j N, --jobs N -- Maximum concurrent jobs (default: 1)
  • --rule RULE -- Only run jobs from this rule (exact name or /regex/)
  • -k, --keep-going -- Continue independent jobs after a failure
  • -n, --dry-run -- Show what would run without executing
  • --json -- Emit NDJSON events on stdout
  • --report-json PATH -- Write the NDJSON event stream to a file
  • --note TEXT -- Attach a note to this run
  • --no-cache -- Ignore cached outputs and re-execute
  • --executor EXEC -- Choose executor: local (default), slurm, ray

Exit codes:

  • 0 -- Success (all jobs succeeded or were cached)
  • 1 -- Runtime error or one or more jobs failed
  • 2 -- Command-line usage error

ox plan

Show the execution plan without running anything.

ox plan                     # Show what would run (optimized)
ox plan --json              # Structured plan output
ox plan --no-optimize       # Show the raw plan (skip optimization passes)
ox plan --level rules       # Show the RuleGraph instead of the JobGraph

ox lint

Validate the Oxymakefile without executing.

ox lint                     # Check for errors
ox lint --json              # Structured diagnostics

Checks for: syntax errors, missing inputs, cycles, ambiguous rules, undefined wildcards.

Inspection Commands

ox dag

Visualize the dependency graph.

ox dag                      # Graphviz DOT output (default)
ox dag --format mermaid     # Mermaid graph syntax
ox dag --group-by rule      # Collapse nodes by field
ox dag --json               # Structured JSON

ox status

Show current execution status.

ox status                   # Summary of current state
ox status --json            # Structured status

ox logs

View job logs.

ox logs stats-alice         # Logs for a specific job
ox logs --failed            # Logs for all failed jobs

ox history

List past runs.

ox history                  # Recent runs
ox history --json           # Structured history

Management Commands

ox gate

Manage gates (human-in-the-loop checkpoints).

ox gate list                              # Show pending gates
ox gate approve qc_check                  # Approve a gate
ox gate approve qc_check --reason "ok"    # Approve with reason

ox snapshot

Manage workflow snapshots for comparison.

ox snapshot save baseline-v1        # Save current state
ox snapshot diff baseline-v1        # Compare with snapshot
ox snapshot list                    # List snapshots

ox invalidate

Invalidate cached outputs to force re-execution.

ox invalidate stats                 # Invalidate a rule
ox invalidate results/alice.txt     # Invalidate a specific output

ox clean

Remove outputs and cache.

ox clean                    # Remove all outputs
ox clean --cache            # Also remove cache
ox clean --state            # Delete a corrupt state.db (it is a regenerable cache)

ox cancel

Cancel running jobs.

ox cancel                   # Cancel all running jobs
ox cancel stats-alice       # Cancel a specific job

ox top

Live TUI dashboard for monitoring execution.

ox top                      # Interactive dashboard

Shows real-time job status, resource utilization, and DAG progress.

Global Options

Every command accepts:

FlagDescription
--color <MODE>Color output mode (auto, always, never)
-V, --versionPrint version
-h, --helpPrint help

Most subcommands additionally accept --json (structured NDJSON output) and -v/-vv (increase verbosity).

Next Steps

ox lock

Generate or verify a reproducibility lockfile.

The ox lock command captures a cryptographic snapshot of the entire workflow — rule definitions, config values, input hashes — into an ox.lock file. Use it to detect unintended changes between runs or across machines.

Subcommands

ox lock generate

Generate an ox.lock file from the current workflow state.

ox lock generate                        # Write ox.lock next to Oxymakefile.toml
ox lock generate -o locks/my.lock       # Write to a custom path
ox lock generate -f path/Oxymakefile.toml

Options:

FlagDescription
-f, --file <FILE>Oxymakefile path (default: Oxymakefile.toml)
-o, --output <OUTPUT>Output lockfile path (default: ox.lock next to the Oxymakefile)

ox lock verify

Verify the current state against an existing ox.lock.

ox lock verify                          # Verify against ox.lock
ox lock verify -l locks/my.lock         # Verify against a custom lockfile

Options:

FlagDescription
-f, --file <FILE>Oxymakefile path (default: Oxymakefile.toml)
-l, --lockfile <LOCKFILE>Lockfile path (default: ox.lock next to the Oxymakefile)

Exit codes:

  • 0 — Lock matches current state
  • 1 — Mismatch detected (details printed to stderr)

Examples

# Pin the workflow before a release
ox lock generate
git add ox.lock && git commit -m "lock: pin workflow v2.1"

# CI: verify nothing drifted
ox lock verify || { echo "Workflow changed since lock!"; exit 1; }

See Also

ox test

Test and validate a workflow without executing it.

The ox test command resolves the DAG, checks for structural errors, and optionally simulates execution order — all without running any shell commands. Use it to catch misconfigurations before committing to a full run.

Usage

ox test                             # Validate entire workflow
ox test results/report.html         # Validate a specific target
ox test --dry-run                   # Simulate execution order
ox test --json                      # Output NDJSON diagnostics

Arguments

ArgumentDescription
[TARGETS]...Target files or patterns to test (default: all)

Options

FlagDescription
-f, --file <FILE>Oxymakefile path (default: Oxymakefile.toml)
-n, --dry-runSimulate execution order without running
--jsonOutput NDJSON

What It Checks

  • Oxymakefile parses without errors
  • All wildcards resolve against [config] values
  • Dependency graph is acyclic
  • Every input is either a source file or produced by a rule
  • Wildcard constraints are satisfied

Examples

# Quick validation in CI
ox test || exit 1

# Check a single target's dependency chain
ox test results/{sample}_stats.tsv

# Dry-run to see execution order
ox test --dry-run

See Also

  • ox lint — lighter-weight syntax check
  • ox plan — show full execution plan

ox dashboard

Web dashboard for monitoring and DAG visualization.

The ox dashboard command starts a local HTTP server that serves an interactive web UI. The dashboard reads from the OxyMake state database and provides real-time job status, DAG visualization, and run history.

Usage

ox dashboard                        # Start on http://127.0.0.1:9876
ox dashboard --port 8080            # Custom port
ox dashboard --bind 0.0.0.0         # Listen on all interfaces
ox dashboard --db path/to/state.db  # Custom state database

Options

FlagDescription
--db <DB>Path to state.db (default: .oxymake/state.db)
--port <PORT>Port to listen on (default: 9876)
--bind <BIND>Bind address (default: 127.0.0.1)

Features

  • Status cards — at-a-glance counts of running, succeeded, and failed jobs
  • DAG visualization — interactive dependency graph
  • Job table — sortable list of all jobs with status and timing
  • Run history — browse past runs and their outcomes

Examples

# Start dashboard alongside a long-running workflow
ox run -j 8 &
ox dashboard
# Open http://127.0.0.1:9876 in a browser

# Expose to the local network (e.g. for a shared workstation)
ox dashboard --bind 0.0.0.0 --port 8080

See Also

ox translate

Translate a Snakefile into OxyMake TOML.

The ox translate command parses a Snakemake Snakefile and emits an equivalent Oxymakefile.toml. Use it to migrate existing Snakemake workflows to OxyMake without rewriting rules by hand.

Usage

ox translate Snakefile                       # Writes Snakefile.translated.toml
ox translate Snakefile -o Oxymakefile.toml   # Writes a custom path

When -o is omitted, the translator writes two files next to the input:

  • <INPUT>.translated.toml — the generated Oxymakefile
  • <INPUT>.translated.toml.escalations.toml — written only when the IR contains escalations

Every run emits a one-line summary to stderr:

translated: N rules (X mechanical, Y with escalations); dropped: Z unsupported top-level constructs; includes: K files NOT followed

ox translate exits with status 2 when escalations were recorded so CI or shell scripts can gate on a clean translation. The files are still written; only the exit code changes.

Arguments

ArgumentDescription
<SNAKEFILE>Path to the Snakefile to translate

Options

FlagDescription
-o, --output <OUTPUT>Write the translated TOML to this path instead of the default <INPUT>.translated.toml. The escalation file lands at <OUTPUT>.escalations.toml.

Translation Notes

The translator handles the most common Snakemake patterns:

  • rule blocks → [[rule]] sections
  • input / outputinputs / outputs
  • expand() calls → OxyMake wildcard {sample} syntax
  • params[rule.params]
  • shellcommand

Complex Python logic inside Snakefiles (e.g., run: blocks, conditional inputs, lambda wildcards) may require manual adjustment after translation. Review the generated TOML and run ox lint to verify.

Examples

# Quick migration — produces Snakefile.translated.toml
ox translate Snakefile
ox lint -f Snakefile.translated.toml   # Verify the result
ox plan -f Snakefile.translated.toml   # Check execution plan

# Custom output path
ox translate Snakefile -o Oxymakefile.toml

# CI gate: fail the job when escalations were emitted
ox translate Snakefile || echo "needs manual review"

See Also

ox query

Query the dependency graph using Bazel-style expressions.

Usage

ox query <EXPRESSION> [OPTIONS]

Expressions

ExpressionDescription
deps(X)All transitive dependencies of target X
rdeps(X)All targets that transitively depend on X
allpaths(X, Y)All paths from X to Y in the DAG

Options

FlagDescription
--jsonOutput JSON instead of human-readable text
-f, --file <FILE>Oxymakefile path (default: Oxymakefile.toml)

Examples

# What does annotate depend on?
ox query 'deps(annotate)'

# What depends on the data rule? (reverse dependencies)
ox query 'rdeps(data)'

# All paths from data to annotate
ox query 'allpaths(data, annotate)'

# JSON output for programmatic use
ox query 'deps(annotate)' --json

See Also

ox export

Export an Oxymakefile to another workflow format.

Usage

ox export <FORMAT> [OPTIONS]

Formats

FormatDescription
snakemakeExport to Snakemake format (Snakefile + config.yaml)

Options

FlagDescription
-f, --file <FILE>Path to the Oxymakefile (default: Oxymakefile.toml)
-o, --output <FILE>Write output to a file instead of stdout

Examples

# Export to stdout
ox export snakemake

# Export to file
ox export snakemake -o Snakefile

# Export a specific Oxymakefile
ox export snakemake -f pipelines/Oxymakefile.toml -o Snakefile

Bidirectional Translation

OxyMake supports bidirectional Snakemake translation:

  • Import: ox translate Snakefile converts Snakemake to OxyMake TOML
  • Export: ox export snakemake converts OxyMake TOML back to Snakemake

This enables zero-friction migration in both directions.

See Also

Configuration

OxyMake uses a layered configuration system. Workflow-level settings live in Oxymakefile.toml, and project-level settings live in .oxymake/config.toml.

Workflow Configuration

The [config] section in Oxymakefile.toml defines variables for wildcard expansion:

[config]
samples = ["A", "B", "C"]
models = ["linear", "ridge"]

These values drive wildcard resolution in rules.

Project Settings

The .oxymake/config.toml file (created by ox init) stores project-level defaults:

[defaults]
jobs = 4                    # Default -j value
executor = "local"          # Default executor
materialize = "always"      # Default materialization policy

[cache]
dir = ".oxymake/cache"      # Cache directory location
max_size_gb = 10            # Maximum cache size

[state]
dir = ".oxymake"            # State directory

Environment Variables

OxyMake respects the following environment variables:

VariableDescriptionDefault
OXYMAKE_JOBSDefault parallelism1
OXYMAKE_EXECUTORDefault executorlocal
OXYMAKE_CACHE_DIRCache directory.oxymake/cache
OXYMAKE_LOGLog levelwarn
OX_CACHE_VALIDATIONCache validation strategy (mtime, mtime+hash, hash)mtime+hash

Configuration Precedence

Settings are resolved in order (later overrides earlier):

  1. Built-in defaults
  2. User global config (~/.config/oxymake/config.toml)
  3. .oxymake/config.toml
  4. Environment variables
  5. Command-line flags

State Directory

The .oxymake/ directory contains:

.oxymake/
  state.db          # SQLite execution state + audit log
  cache/            # Content-addressable output cache
  config.toml       # Project settings

The state database (state.db) uses SQLite WAL mode for concurrent access. It must reside on local disk (not NFS/Lustre/GPFS).

Next Steps

Expression Language

OxyMake includes a minimal expression language for conditional guards and dynamic values in workflow definitions. The language is deliberately limited: pure functions, no loops, no side effects.

Guard Expressions

The when field on a rule accepts a boolean expression:

[rule.expensive_model]
output = ["results/{seed}_{model}.txt"]
shell = "train --seed {seed} --model {model}"
when = "seed in @selected_seeds"

If the guard evaluates to false, the job is not created in the DAG.

Supported Operators

Membership

when = "sample in @high_priority_samples"   # Check if wildcard is in a config list
when = "model in ['linear', 'ridge']"       # Check against inline list

Comparison

when = "wildcards.threshold >= 0.5"
when = "wildcards.replicate != 'control'"

Logical

when = "sample in @fast_samples and model == 'linear'"
when = "not (sample in @excluded)"

Variable References

Wildcards

Access wildcard values with bare names or the wildcards. prefix:

shell = "process {sample}"                   # Bare wildcard in commands
when = "wildcards.sample in @selected"       # Explicit prefix in guards

Config References

Reference config arrays with @:

when = "sample in @priority_samples"         # @name refers to config.name

Built-in Variables

VariableDescription
{input}Resolved input path(s)
{output}Resolved output path(s)
{wildcards.NAME}Resolved wildcard value
{params.NAME}Rule parameter value
{rule}Rule name

String Interpolation

In shell, run, and script fields, {braces} perform string interpolation:

shell = "python process.py --input {input} --output {output} --sample {wildcards.sample}"

Double braces {{ and }} produce literal braces (useful in Python code):

run = """
result = {{"key": "value"}}
"""

Design Philosophy

The expression language is intentionally not Turing-complete. Complex configuration logic should happen outside the Oxymakefile:

python gen_config.py > config.toml     # Generate config externally
ox run --config config.toml            # Use generated config

This preserves static parseability: any tool can read an Oxymakefile without executing code.

Next Steps