Bioinformatics Pipeline
This cookbook walks through a multi-sample FASTQ-to-BAM-to-VCF variant calling
pipeline in OxyMake. The workflow uses sort, grep, and wc as stand-ins
for real bioinformatics tools (BWA, samtools, GATK), so you can run it on any
machine without installing anything.
The concepts transfer directly to a production pipeline: just swap the shell commands for real tool invocations.
What You Will Learn
- Wildcard-driven sample processing across multiple samples
- Named inputs for rules with multiple input files
- Tags for organizing pipeline stages
- Target-based filtering to run a subset of samples
--rulefiltering to run a subset of stages
The Complete Oxymakefile
Create a directory and save this as Oxymakefile.toml:
ox_version = "0.1"
[config]
samples = ["NA12878", "NA12891", "NA12892"]
chromosomes = ["chr1", "chr2", "chr3"]
# ── Default target ──────────────────────────────────────────────
[rule.all]
input = ["results/cohort_report.txt"]
# ── Stage 1: Generate mock FASTQ reads ─────────────────────────
[rule.simulate_reads]
output = ["fastq/{sample}_R1.fastq", "fastq/{sample}_R2.fastq"]
tags = ["stage.simulate", "fast"]
shell = """
mkdir -p fastq
for i in $(seq 1 50); do
echo "@{sample}_read${i}/1 chr$((i % 3 + 1)):$((i * 100))" >> {output[0]}
echo "ACGTACGTACGTACGT" >> {output[0]}
echo "+" >> {output[0]}
echo "IIIIIIIIIIIIIIII" >> {output[0]}
echo "@{sample}_read${i}/2 chr$((i % 3 + 1)):$((i * 100))" >> {output[1]}
echo "TGCATGCATGCATGCA" >> {output[1]}
echo "+" >> {output[1]}
echo "IIIIIIIIIIIIIIII" >> {output[1]}
done
"""
# ── Stage 2: Align reads → sorted BAM ──────────────────────────
# Stand-in: sort the FASTQ by read name to simulate alignment + sorting.
[rule.align]
input = { r1 = "fastq/{sample}_R1.fastq", r2 = "fastq/{sample}_R2.fastq" }
output = ["aligned/{sample}.bam"]
tags = ["stage.align", "compute-heavy"]
resources = { cpu = 4, mem = "8G" }
shell = """
mkdir -p aligned
echo "## BAM for {sample}" > {output}
echo "## Aligned from {input.r1} and {input.r2}" >> {output}
cat {input.r1} {input.r2} | grep "^@" | sort >> {output}
echo "## EOF" >> {output}
"""
# ── Stage 3: Call variants per chromosome ───────────────────────
# Stand-in: grep reads matching the chromosome, count them as "variants."
[rule.call_variants]
input = { bam = "aligned/{sample}.bam" }
output = ["vcf/{sample}_{chrom}.vcf"]
tags = ["stage.call", "compute-heavy"]
resources = { cpu = 2, mem = "4G" }
shell = """
mkdir -p vcf
echo "##fileformat=VCFv4.2" > {output}
echo "##source=oxymake-cookbook" >> {output}
echo "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO" >> {output}
grep "{chrom}" {input.bam} | awk '{{
split($2, a, ":");
printf "%s\t%s\t.\tA\tG\t30\tPASS\tDP=20\n", a[1], a[2]
}}' >> {output}
"""
# ── Stage 4: Merge per-chromosome VCFs into one per sample ─────
[rule.merge_vcf]
input = ["vcf/{sample}_{chrom}.vcf"]
output = ["vcf/{sample}_merged.vcf"]
tags = ["stage.merge"]
shell = """
echo "##fileformat=VCFv4.2" > {output}
echo "##source=oxymake-merge" >> {output}
echo "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO" >> {output}
for f in {input}; do
grep -v "^#" "$f" >> {output}
done
sort -k1,1 -k2,2n -o {output} {output}
"""
# ── Stage 5: Per-sample QC report ──────────────────────────────
[rule.qc]
input = { bam = "aligned/{sample}.bam", vcf = "vcf/{sample}_merged.vcf" }
output = ["qc/{sample}_report.txt"]
tags = ["stage.qc", "fast"]
shell = """
mkdir -p qc
echo "=== QC Report: {sample} ===" > {output}
echo "Total reads: $(grep -c "^@" {input.bam})" >> {output}
echo "Variants called: $(grep -vc "^#" {input.vcf})" >> {output}
echo "Chromosomes: $(grep -v "^#" {input.vcf} | cut -f1 | sort -u | tr '\n' ' ')" >> {output}
"""
# ── Stage 6: Cohort report ─────────────────────────────────────
[rule.cohort_report]
input = ["qc/{sample}_report.txt"]
output = ["results/cohort_report.txt"]
tags = ["stage.report"]
shell = """
mkdir -p results
echo "=============================" > {output}
echo " Variant Calling Cohort Report" >> {output}
echo "=============================" >> {output}
echo "" >> {output}
for f in {input}; do
cat "$f" >> {output}
echo "" >> {output}
done
echo "--- Summary ---" >> {output}
echo "Samples processed: $(echo {input} | wc -w | tr -d ' ')" >> {output}
"""
Create the Project
mkdir bioinfo-pipeline && cd bioinfo-pipeline
# Save the Oxymakefile.toml above
ox init # if you want the .oxymake directory pre-created
No input data files are needed -- the simulate_reads rule generates
everything from scratch.
Run the Full Pipeline
ox plan
Plan: 5 rules, 15 jobs, 3 source files
Targets: results/cohort_report.txt
1. [simulate_reads-NA12878] rule=simulate_reads -> [fastq/NA12878_R1.fastq, fastq/NA12878_R2.fastq]
2. [simulate_reads-NA12891] rule=simulate_reads -> [fastq/NA12891_R1.fastq, fastq/NA12891_R2.fastq]
3. [align-NA12878] rule=align -> [aligned/NA12878.bam]
4. [call_variants-NA12878-chr1] rule=call_variants -> [vcf/NA12878_chr1.vcf]
...
15. [cohort_report] rule=cohort_report -> [results/cohort_report.txt]
ox run -j 4
OxyMake runs up to 4 jobs in parallel. The simulate_reads jobs run first
(no dependencies), then align, then call_variants fans out across
samples and chromosomes, and finally everything converges into the cohort
report.
Filter by Sample
Run only one sample during development by requesting its leaf target (wildcards in the target select the matching jobs):
ox run "qc/NA12878_report.txt"
This builds the pipeline for NA12878 only, skipping NA12891 and NA12892. Combined with caching, this lets you iterate on pipeline logic without waiting for all samples.
Later, run the full cohort:
ox run
NA12878 is cached. Only NA12891 and NA12892 are computed.
Filter by Rule
Run only the QC stage with --rule (exact name or /regex/; assumes
upstream outputs exist):
ox run --rule qc
View the DAG grouped by stage:
ox dag --group-by tag
Named Inputs
Several rules use named inputs for clarity. Compare:
# Positional (works but cryptic with multiple inputs)
input = ["aligned/{sample}.bam", "vcf/{sample}_merged.vcf"]
shell = "check {input[0]} {input[1]}"
# Named (self-documenting)
input = { bam = "aligned/{sample}.bam", vcf = "vcf/{sample}_merged.vcf" }
shell = "check {input.bam} {input.vcf}"
Named inputs make your workflow readable as it grows.
Adding a New Sample
Edit Oxymakefile.toml:
[config]
samples = ["NA12878", "NA12891", "NA12892", "NA12893"]
Run again:
ox run -j 4
Only the NA12893 jobs run. Everything else is cached.
Adapting to Real Tools
Replace the stand-in commands with real bioinformatics tools:
[rule.align]
input = { r1 = "fastq/{sample}_R1.fastq", r2 = "fastq/{sample}_R2.fastq" }
output = ["aligned/{sample}.bam"]
tags = ["stage.align", "compute-heavy"]
resources = { cpu = 8, mem = "32G" }
shell = """
bwa mem -t {resources.cpu} reference.fa {input.r1} {input.r2} \
| samtools sort -@ 4 -o {output}
samtools index {output}
"""
The workflow structure stays the same. Only the shell commands change.
Next Steps
- Rules and Wildcards -- wildcard expansion and constraints
- Tags and Filtering -- organizing large workflows
- Idempotent Execution -- cooperative multi-session runs