Skills as code, not docs: shipping the scaffold instead of describing it

Where this came from

At monday.com/vibe we build a coding agent that generates apps against the monday platform. The agent runs inside a controlled environment — same language, same boilerplate, same project shape, same SDK. That control is a luxury, and when we started building Vibe's internal skill system on top of it, we noticed something about how other people's skills are written.

Most agent skills today are documentation. A SKILL.md for "add S3 to your service" tells the agent which package to install, which env vars to set, which exception to retry, which IAM policy to attach. The agent reads it, internalises it, and applies it to whatever project shape happens to be in front of it. Same skill text, every project, every time.

Vibe's skills don't read like that. Because the project shape is fixed, we don't have to tell the agent how to wire S3 into a generic Express app — that's not the shape it'll land in. We ship the wiring instead. The skill installs deps, drops a configured client into a known path, registers a route, and the SKILL.md covers only the part that genuinely needs agent judgement: when to use it, which bucket policy to pick, what to assert.

It worked unreasonably well. Token spend per task dropped, output got more consistent, classes of bug that used to leak into production stopped doing so. So I started asking the obvious follow-up: how much of this is portable to projects where the agent doesn't run inside our controlled boilerplate?

More teams have "vibe-shaped" surface than they think

The fully-controlled environment is the strong-form version. The weaker, more common version is just internal conventions — a company's preferred Next.js setup, the way the platform team thinks rate limiting should be wired, the boilerplate everyone copy-pastes into new services. Anywhere a team has converged on a way of doing something, the skill should ship the conventions, not describe them.

That's the design surface this post explores: skills that ship the answer, paired with a scaffold that's already shaped the way the skill expects. Not as a thesis — as an open-source artifact you can run.

skillpack

skillpack is the POC I built to test the claim — small, focused, just enough surface to run the eval below. One CLI command — skillpack scaffold react remotion — drops a working project into your cwd, dependencies installed, conventions wired, footguns fixed in the template code itself, and a small set of SKILL.md files that load on demand.

This post is a follow-on to the previous one on SDKs vs MCP. That post argued: encode the API surface once in code, don't make the agent rediscover it on every step. This one extends the same argument one layer up — encode the project surface once in code, don't make the agent rediscover that either.

The rest of the post is the eval that tested it.

Headline result — three-way Remotion eval

We pitted skillpack against the two alternatives a team would actually consider: no skill at all, and the maintainer's own production skill. Same prompt: build a 10-second Remotion video, verify with install + typecheck + a successful headless render to MP4. Three trials per cell, fresh-context claude -p, claude-sonnet-4-6.

  • no_skill — empty cwd, no skill, agent designs everything from scratch.
  • remotion_skill — the Remotion team's own production skill (SKILL.md + 36-rule reference tree) installed at .claude/skills/. The best docs-as-skill you can buy.
  • skillpackskillpack scaffold react remotion runs first (timed), then pnpm install (timed), then the agent. AGENTS.md and the skillpack-wrapped Remotion skill auto-load.

The skillpack agent did zero setup commands. The other two cells spent 7 and 8.7 Bash calls each, mostly scaffolding. That's the result; the table below is the consequence.

Cell MP4 ✓ 1st render Turns Tools Output tokens Cost Agent time Total time
no_skill 100% 100% 13±2 12±2 2,794 ± 153 $0.210 ± 0.013 140 s 140 s
remotion_skill 100% 100% 18±3 16±3 3,411 ± 611 $0.262 ± 0.028 169 s 170 s
skillpack 100% 100% 13±2 11±2 2,452 ± 779 $0.192 ± 0.034 124 s 144 s

Skillpack is Pareto-optimal on every per-agent metric: cheapest (−9% vs no_skill, −27% vs remotion_skill), fewest output tokens, fewest tool calls, fastest agent time. Total wall-clock including the scaffold+install step (~17 s) is only 4 s slower than no_skill — those 17 s of setup pay for themselves in saved agent work.

Two surprises in this table:

The maintainer's own skill is the most expensive cell. remotion_skill succeeds first-try but spends 27% more dollars and 36% more output tokens than skillpack. On a small task, the reading overhead of a 36-file reference tree exceeds the work the skill saves. This is the part most "let's add skills to our agent" projects don't measure — a skill that ships ~30k tokens of rules every step is negative ROI below some task size.

The tool-call mix tells the whole story. no_skill does Bash=7, Write=5.3 (creating from scratch). remotion_skill does Bash=8.7, Write=3, Read=1.7, Edit=1.3, Skill=1 (still creating, plus the skill overhead). skillpack does Read=4.7, Bash=3, Edit=1.3, Skill=1, Write=1 — zero setup commands. Just reads, edits, and uses the skill. The cost number is the consequence of that decomposition, not an independent dimension.

Canonical MP4s (trial-1): no_skill.mp4 · remotion_skill.mp4 · skillpack.mp4. Full writeup, methodology, per-trial data, caveats: evals/workspaces/iteration-7/REPORT.md.

Footgun fixes, shipped as code

A Remotion 4 project that's wired almost right will typecheck happily, run in the dev server happily, and then fail at headless render with Visited "http://localhost:3000/index.html" but got no response (React 19 flake) or this file does not contain registerRoot (missing entry call) or MyVideo.js doesn't exist (TS .js extension imports that webpack doesn't honour).

The official Remotion skill dodges these by telling the agent to run npx create-video, which happens to pin React 18 and call registerRoot for you — sidestepping the issues without ever naming them. The fix is implicit; the next time create-video's defaults change, the skill breaks silently.

skillpack dodges them explicitly: the react/remotion scaffold's Root.tsx calls registerRoot directly and uses bare imports (commit 8a2154c). React version, renderer entrypoint, and module resolution are version-controlled in the boilerplate, not in a sentence the agent might or might not read.

The footgun lives in one place — the scaffold — and is fixed in code. Not in a SKILL.md sentence the agent has to parse correctly on every generation. Not in implicit transitive behaviour of a third-party CLI. In code.

When upstream ships a fix to one of these, your scaffold inherits it; every future generation gets it for free.

When this pattern applies

The POC tests one shape of agent task: you point an agent at one focused job and want it to ship:

  • "Generate a 15-second product reel for our launch tomorrow."
  • "Build me a 12-slide investor deck from this outline."
  • "Run the cohort retention analysis on this CSV and produce a chart."
  • "Scaffold a dashboard that polls this endpoint and graphs the latency."

For tasks like these, the bottleneck is rarely the task itself — it's the task scaffold. The agent needs to pick a framework, pin compatible versions, dodge the well-known footguns, wire the entrypoints the headless tooling expects, and then start on the actual work. By the time it has, half its turns and a third of its tokens are gone on plumbing the user didn't ask about.

A scaffolded skill pre-pays that cost as a CLI command. The agent inherits a working project and a small, tuned AGENTS.md primer, then spends its turns on the task. In the eval above that translated to −27% cost, −28% output tokens, and −27% agent wall-clock vs. the same agent starting from the best docs-as-skill alternative — with the same first-attempt render success rate.

The pattern is not worth the overhead when:

  • The user is iterating on a long-running, multi-feature codebase — scaffold-once doesn't help here; you want a long-lived CLAUDE.md/AGENTS.md instead.
  • The boilerplate you'd need doesn't exist yet and isn't worth authoring — for a one-off, just hand-bootstrap.
  • The task is dominated by genuine domain reasoning (e.g. "design our retry policy"), not by setup. Scaffolding doesn't help with judgement work.

Where this leaves things

The previous post ended with "encode the API surface once in code, don't make the agent rediscover it." This one ends with the same shape one layer up: encode the project surface once in code, don't make the agent rediscover that either.

The pattern that keeps working, across both posts: figure out the layer that's stable in your environment, encode it once as code, and let the agent spend its tokens on the part that genuinely varies. SDKs do this for API surfaces. skillpack does it for project surfaces. There's probably another layer above this — the task surface, whatever that turns out to mean. I'll write that one when I've shipped it.

skillpack on GitHub is the POC, its eval harness, and the iter-7 raw transcripts. It's a proof, not a project I have the time to maintain — I built it to test the claim above, the eval ran, the claim held up. If anyone wants to take it forward into something real, please do; open an issue and I'll happily hand over context, design notes, and the meta-skills already in the repo.

← All posts