May 23, 2026 9 min EN ulk claude-code experience-report pros-cons

ulk: pros and cons

This article cuts to the chase: what ulk concretely brings, what it costs, and in which cases you’re better off skipping it. All claims are sourced from the repo (CLAUDE.md, registry.json, _shared/).

Pros

1. Shared vocabulary

On a team project, “launch Sargeras” replaces half a page of audit prompt. Each team member invokes the same agents, gets the same output structure. The project documentation (CLAUDE.md after install) lists the key agents, so even a newcomer has the list.

Measurable benefit: fewer ad hoc prompts, less variance between runs.

2. Reproducibility

Agents are static markdown files. The same /ulk:sargeras launched by two different people on the same repo will produce very similar outputs (up to the LLM’s non-determinism). Source: framework/agents/registry.json is auto-generated and versioned.

3. Context economy

Several explicit levers:

Systematic sub-agents for heavy explorations (hygiene Rule 3). The sub-agent returns only the summary.
Local skills (Figma, Swift, Flutter): Claude reads the SKILL.md instead of reasoning without a guide.
Official Anthropic plugins delegated (/feature-dev, /simplify, /commit): no re-implementation needed. Source: framework/agents/_shared/plugins-protocol.md.
Free local LLMs (apfel on macOS 26+, ollama gemma3:1b) for micro-tasks: commit messages, classification, secret detection. 0 Claude tokens. Source: CLAUDE.md section “CLI Tools”.

4. CLI > MCP

Explicit rule (CLAUDE.md, section CLI Tools): “CLI available → use it (0 tokens).” When a CLI tool exists locally (gh, vercel, neonctl, gws), it’s preferred over the network MCP. Consequence: fewer tokens consumed in round-trips, more speed.

5. Formalized recurring audits

Sargeras (45) — 10 axes (security, perf, architecture, tests, doc, code quality, CI/CD, a11y, costs, compliance).
ED-209 (52) — dedicated security.
Killbill (56) — cloud costs with a real killswitch.
Context Audit (55) — context health (token waste, unnecessary MCPs, bloated CLAUDE.md).

These audits can run as cloud routines without a local machine (agent Routine 53). Recommended triggers: daily 18h, weekly Monday, check_suite.completed (failed).

6. Memory between sessions

The Knowledge Vault Loop (Lovecraft 47) captures decisions to an Obsidian-compatible vault, distributes into CLAUDE.md, and surfaces at every startup. Auto-integrated: 2b3 captures, Godspeed surfaces, Gandalf audits health. Source: CLAUDE.md section “Knowledge Vault Loop”.

Effect: you don’t re-explain the architecture at every session.

7. Context hygiene taught

The 4 rules (/rewind, /clear, sub-agents, proactive /compact) are inherited by all agents via _shared/base-rules.md. Skill /context-audit produces a 0-100 score. Agent gandalf (34) checks rule compliance. Source: framework/agents/_shared/context-hygiene-protocol.md.

This is rare: few Claude Code frameworks formalize hygiene to this level.

8. Zero lock-in

Agents = markdown with no dependencies.
Installation = file copy into ~/.claude/.
Uninstallation = ./uninstall.sh.
No service to pay, no SaaS.

9. Extensible

Global numbering preserved (next: 68). Authoring convention documented (.claude/rules/agents-authoring.md). The generator framework/cheatheet/generate-registry.cjs updates the registry automatically.

Adding your own agent takes about ten minutes.

10. Integrated third-party community

Skills bundled by default: Figma (7), Swift (7), Flutter (2), context-audit, cwb-app-icon. Official Anthropic plugins delegated instead of reinvented.

→ you benefit from the ecosystem.

Cons

1. Learning curve

90 agents, 12 categories, 7 phases, around twenty shared protocols (_shared/), dozens of slash-commands. Even though “10 agents are enough to get started,” the initial read of the repo is dense.

Mitigation: the root CLAUDE.md provides a shortcut, and Bruce (25) is the only name to memorize to get started.

2. Sometimes opaque names

Agent names rely on pop culture (Sargeras = Warcraft, 2b3 = French boy band, Killbill = Tarantino). Strong internal consistency (see post #4), but:

A dev unfamiliar with these universes must make an effort to memorize.
On an international team, some references (2 Be 3) won’t resonate.

Mitigation: registry.json contains functional descriptions; natural language aliases ("omniscient audit" for Sargeras) are supported.

3. Cognitive attack surface

When you have 90 agents, there’s a temptation to always hunt for one. “Which one to use for this?” can become procrastination. Bruce (25) is meant to arbitrate, but you can still fall into the trap.

Mitigation: stick to the top 10 (see post #4) until you’re comfortable.

4. Dependency on Claude Code

ulk only works with Claude Code (Anthropic CLI). If Anthropic changes the API, changes the skills format, or changes the slash-command grammar, all of ulk needs to be adapted. Real vendor dependency.

Mitigation: since agents are standard markdown, they’re portable to other LLM tools (Cursor, Aider) with manual adaptation.

5. LLM cost not directly controlled

ulk saves context (sub-agents, CLI > MCP, local LLMs for micro-tasks), but remains an intensive use of Claude Opus/Sonnet. On a large Sargeras audit, several hundred thousand tokens can pass through.

Mitigation: Killbill (56) audits cloud costs (Vercel, GitHub, Neon). On Claude API costs, ULK-186 is shipped: Bruce Phase 5.1 reads api-usage.jsonl and displays a cost alert before each audit (”≈ Xk tokens, ≈ $Y”). ULK-184 (PreToolUse hook) and ULK-185 (skill /killbill api-budget) are in progress. In the meantime: track usage from the Anthropic console.

6. Structural tests, not behavioral

Since April 2026 (Epic ULK-181→183), a golden file test system validates agent structure: presence of required frontmatter, critical H2 sections, invocation patterns. Code: framework/tests/agents-golden.test.mjs + fixtures in framework/tests/agents/<agent>.golden.md.

Limitation: these tests are structural and deterministic (free, fast, no API calls). They don’t test runtime behavior — so a semantic change to a prompt that shifts the quality of the LLM output can pass undetected. To validate behavior, manual execution is always required.

Mitigation: _shared/ protocols minimize duplication; golden files detect structural regressions; behavioral testing remains to be invented.

7. Limited community

Single-author project (math.drouet). No large volume of external contributors. If the author stops maintaining, the project stagnates.

Mitigation: MIT-compatible, easy to fork, self-contained content.

8. Friction on non-greenfield projects

On an existing project, ulk adds a ~/.claude/ folder (user-level) plus a few files (CLAUDE.md, docs/spec.md, docs/todo.md) if you follow the Shuri pipeline. A team that already has its own doc conventions may resist.

Mitigation: Shuri can be disabled; you can use ulk only for audits (Sargeras, ED-209, Killbill) without touching the documentation.

9. macOS-friendly first

Several integrations are macOS-specific:

apfel (local Apple Intelligence) → macOS 26+ Apple Silicon only.
Skill cwb-app-icon → iOS/macOS (.appiconset generation).
Hooks like --with-xavier-hook tested on macOS.

Source: CLAUDE.md sections “CLI Tools” and “Community Skills”.

Mitigation: ollama gemma3:1b remains cross-platform; the vast majority of agents work on Linux/Windows.

10. No guarantee on outputs

Agents define a protocol, not a deterministic output. The LLM may produce a slightly different report on each run. For compliance or auditable use, it remains a guide, not proof.

Mitigation: reports can be archived (Sargeras writes a dated file); for formal compliance, have a human validate.

When to use ulk

Cases where the benefit/cost ratio is strongly favorable:

Solo dev juggling multiple projects — Xavier + Lovecraft transform the experience.
Team wanting a shared AI vocabulary — /ulk:bruce instead of a Notion page of prompts.
Production project with cloud costs to monitor — Killbill alone can justify the install time.
Recurring audit (security, perf, tech debt) — Sargeras + ED-209 as cloud routines.
Creating a project from scratch — Bruce → Tony → Shuri → project-decomposer saves half a day of framing.

When to skip it

You don’t use Claude Code (Cursor, Copilot, Aider). ulk is unusable as-is.
You’re a complete Claude Code beginner — learn /clear, /compact, sub-agents, skills, MCP first. ulk comes after.
You work exclusively on sensitive private code without authorized external AI. Resolve your organization’s AI usage policy first.
You need determinism (formal SOC2 audits, etc.). A classic linter remains more defensible.
Your project is a 200-line script — the install overhead is not justified.

Verdict in one sentence

ulk is a toolkit of specialized Claude Code agents, relevant when you’ve moved beyond the single script and want to formalize what you repeat every day — at the cost of a learning curve, a dependency on Claude Code, and a usage discipline that needs to be maintained.

What’s next

Post #5: 3 concrete use cases with commands.
Video: full end-to-end demo.
Slides: condensed version for a public talk.