Under the hood

Mooter pastors the Moos.

Two ideas make local-first routing work without trading off the answer.

Why your laptop can run Opus-grade models now

Quantization, in 30 seconds.

Full-precision AI models are huge. A 30-billion-parameter model in 32-bit floats weighs 120GB — too big for your GPU. Quantization compresses the model's numbers to 4-bit integers, shrinking it to 18GB while keeping ~98% of the quality. The same model now runs on your RTX 4090 instead of a data center. Mooter prefers quantized local models for T0 whenever quality stays above the bar — saving you money without trading off the answer.

qwen3:30b (full precision FP32)
████████████████████  120 GB
✗ doesn't fit your GPU

qwen3:30b (quantized Q4_K_M)
████  18 GB
✓ fits 24GB GPU · ~98% quality
Quantization in mooter

T0 models (local, free)        Q4_K_M default
├─ qwen2.5-coder:7b            5 GB · code
├─ qwen3:30b                   18 GB · reasoning
├─ gemma3:12b                  7 GB · general
└─ deepseek-r1:7b              4 GB · math

T1–T3 models                   served by provider
                               quantization handled cloud-side

Quality delta (T0 quantized vs FP32):
  qwen2.5-coder    -1.8pp
  qwen3:30b        -1.2pp
  gemma3:12b       -2.4pp

Source: mooter benchmark, 34 prompts × 3 arms, blind judge

The local frontier moves fast. As of 2026, notable local-capable coding models include Qwen3-Coder-Next (~58.7% SWE-bench Verified), GLM-5 (~77.8%), DeepSeek V3.2, and Llama 4 Scout (10M-token context). Those are vendor/community-reported numbers — not mooter benchmarks; check each model card before relying on them. mooter routes T0 to whatever you've pulled: the default stays the dependable qwen2.5-coder, and if you pull a stronger model mooter uses it automatically — no config change.

Specialize the brain on your code — locally, overnight.

LoRA and DoRA, in 30 seconds.

A 7-billion-parameter model knows a lot — but it doesn't know your codebase. Re-training from scratch would take weeks and a cluster. LoRA (Low-Rank Adaptation) lets you train a tiny 'patch' — usually under 100MB — that adjusts the model toward your specific style, your conventions, your domain. DoRA is the 2024 refinement: it separates how much the patch moves a weight from which direction, which makes the adapter sharper for the same compute budget. Mooter's Wave 5 trains a DoRA r=32 adapter on your repo locally on your RTX 4090 in 3-6 hours, overnight. Activate it in your terminal. Your code never leaves your machine.

┌─ Base model (frozen, 7B params, 5GB) ─┐
│   ┌──────────────────────────────┐    │
│   │ LoRA adapter (your code)     │    │
│   │ r=32 · ~80MB · trained 4h    │    │
│   └──────────────────────────────┘    │
└────────────────────────────────────────┘
         ↓
   Output specialized to your repo
🛠 Adapter Forge — Wave 5 (coming Q3 2026)

Train your code's brain.
Locally. Overnight. ToS-safe.

  ✓ Self-distillation on your repo
  ✓ DoRA r=32 + Unsloth
  ✓ Qwen3-14B base
  ✓ Eval harness vs Sonnet
  ✓ Hot-swap via vLLM
  ✓ Your code never leaves your machine

Eligibility: 30 days of mooter use + ≥200 logged decisions
Estimated time: 3–6 hours on RTX 4090
Estimated gain: +12pp quality on domain prompts

Status: in development · expected Q3 2026

How a DoRA adapter decomposes a weight

W₀frozen+B · Arank-r update (LoRA)→ DoRA splits it:magnitude mdirection Ŵtrainedseparately

LoRA freezes the base weight W₀ and learns a low-rank update B·A (rank r). DoRA additionally decomposes that update into a magnitude and a normalized direction, training them separately — sharper adapters at the same rank. Implementation reference: HuggingFace PEFT. As of 2026, fused Triton kernels (e.g. Unsloth's fused LoRA/DoRA) cut training memory and roughly double throughput vs the naïve implementation — which is what makes the overnight RTX 4090 run above feasible.

How the router decides — classify.js + the hook

Mooter is a Claude Code UserPromptSubmit hook, not a proxy. Every prompt passes through inject_context.js (the hook entry) before Claude Code sees it; the hook runs classify.js and emits a <router-hint> + a <tier-badge>. If the hook errors, Claude Code proceeds unchanged — routing never blocks you.

prompt
  │
  ▼  UserPromptSubmit hook            inject_context.js
  ▼  pattern match (4 regex banks)    patterns.js  — HIGH / MED / LOW / TRIVIAL risk
  ▼  complexity score → tier T0–T3    classify.js  — TUNED thresholds
  ▼  safety guard                     classify.js  — HIGH_RISK never downgrades (deploy/migration)
  ▼  low confidence? semantic check   arbiter.js   — Haiku arbiter (long-tail only)
  ▼  emit hint + badge                <router-hint> · <tier-badge>
  │
  ▼  Claude Code runs the chosen model

The pattern banks live in patterns.js (HIGH/MED/LOW/TRIVIAL), counted into PATTERN_COUNT (classify.js). User intent wins over the heuristic tier except when it would downgrade a HIGH_RISK prompt — mooter refuses to route a deploy or migration to a weaker model. The arbiter (a cheap Haiku call) only fires on the low-confidence long tail (~17% of prompts); the other ~83% stay on the zero-cost regex fast path. It's all open source — read tools/router/classify.js, patterns.js, arbiter.js and inject_context.js on GitHub.

Dynamic Workflows, made visible — the herd 🐄

Anthropic shipped Dynamic Workflows in May 2026: Claude Code spawns up to 16 subagents in parallel (capped at 1000 per run) and fans your prompt across them. It's a great mental model — and mooter reuses it. The one gap Anthropic names in their own guidance is visibility: “no transparent intermediate output, making it challenging to monitor progress in real time.” The 16 agents in flight are a black box until the final answer lands.

A cloud orchestrator can't stream 16 live subagent logs without saturating your terminal and your bill. A local herd can: your GPU is right there, Q4_K_M Moos answer fast enough that the one-liner shows up during the work, and the hook owns the render moment. So mooter inverts the contract — the cheaper the work, the louder it speaks.

CapabilityClaude Dynamic WorkflowsMooter Moos 🐄
Spawned per prompt✅ up to 16 concurrent✅ bounded by your hardware
Subagent count visible during execution❌ hidden until final answer✅ 🐄×N live in the statusline
Per-agent activity log❌ no transparent intermediate output✅ one line per spawn (standard verbosity)
Per-agent latencyonly after completion✅ live avg + Stop digest
Where it runs☁ Anthropic cloud (Opus 4.8 orchestrator)hybrid — orchestrator stays on Claude Code; workers can be local Moos
Cost per spawnAnthropic billing$0 for local Moos (your hardware)
“Peak concurrent” statnot surfaced✅ Stop digest: peak concurrent: N

Honest scope: mooter doesn't replace Dynamic Workflows — the orchestrator stays in Claude Code; Moos are the local workers it can fan to. We don't claim 1000 concurrent Moos (your effective cap is whatever your GPU holds, not Anthropic's cloud limit). We made the local side of the same idea visible — that's it.

Newer, faster local backends — opt-in, never default

The local-first path with the frozen classifier is what runs out of the box. On top of it, mooter ships three performance backends you can turn on when your hardware supports them — each is opt-in, and each falls back gracefully when it can't help.

  • 3-bit KV cache (TurboQuant). Google DeepMind's TurboQuant (ICLR 2026, arXiv:2504.19874) shrinks the KV cache 3.6–5.2× (model-dependent). It's experimental and built from source — mainline llama.cpp hasn't merged it — so mooter wraps the build and stays on stock inference until you enable it.
  • Speculative decoding (EAGLE-3) via vLLM. A draft model proposes tokens the target verifies in parallel — 2–2.5× faster on a GPU. mooter checks VRAM headroom first and falls back to plain vLLM when it's short.
  • MiniMax M3, ready on day one. The weights aren't public yet (expected ~June 11, 2026). A watcher polls HuggingFace and offers a one-command Ollama install the moment they land — nothing downloads until you say so.

Honest scope: none of these are on by default, and none of them change which tier a prompt gets — classify.js still decides that, and its logic has been byte-frozen for 12 consecutive releases. They make the local side faster and lighter; the routing you trust is unchanged.