Skills as code, not docs: shipping the scaffold instead of describing it
Where this came from
At monday.com/vibe we build a coding agent that generates apps against the monday platform. The agent runs inside a controlled environment — same language, same boilerplate, same project shape, same SDK. That control is a luxury, and when we started building Vibe's internal skill system on top of it, we noticed something about how other people's skills are written.
Most agent skills today are documentation. A SKILL.md for "add S3 to
your service" tells the agent which package to install, which env vars to
set, which exception to retry, which IAM policy to attach. The agent
reads it, internalises it, and applies it to whatever project shape
happens to be in front of it. Same skill text, every project, every
time.
Vibe's skills don't read like that. Because the project shape is fixed,
we don't have to tell the agent how to wire S3 into a generic Express
app — that's not the shape it'll land in. We ship the wiring instead.
The skill installs deps, drops a configured client into a known path,
registers a route, and the SKILL.md covers only the part that
genuinely needs agent judgement: when to use it, which bucket policy to
pick, what to assert.
It worked unreasonably well. Token spend per task dropped, output got more consistent, classes of bug that used to leak into production stopped doing so. So I started asking the obvious follow-up: how much of this is portable to projects where the agent doesn't run inside our controlled boilerplate?
More teams have "vibe-shaped" surface than they think
The fully-controlled environment is the strong-form version. The weaker, more common version is just internal conventions — a company's preferred Next.js setup, the way the platform team thinks rate limiting should be wired, the boilerplate everyone copy-pastes into new services. Anywhere a team has converged on a way of doing something, the skill should ship the conventions, not describe them.
That's the design surface this post explores: skills that ship the answer, paired with a scaffold that's already shaped the way the skill expects. Not as a thesis — as an open-source artifact you can run.
skillpack
skillpack is the POC I built to
test the claim — small, focused, just enough surface to run the eval
below. One CLI command — skillpack scaffold react remotion — drops a
working project into your cwd, dependencies installed, conventions
wired, footguns fixed in the template code itself, and a small set of
SKILL.md files that load on demand.
This post is a follow-on to the previous one on SDKs vs MCP. That post argued: encode the API surface once in code, don't make the agent rediscover it on every step. This one extends the same argument one layer up — encode the project surface once in code, don't make the agent rediscover that either.
The rest of the post is the eval that tested it.
Headline result — three-way Remotion eval
We pitted skillpack against the two alternatives a team would actually
consider: no skill at all, and the maintainer's own production skill.
Same prompt: build a 10-second Remotion video, verify with install +
typecheck + a successful headless render to MP4. Three trials per
cell, fresh-context claude -p, claude-sonnet-4-6.
no_skill— empty cwd, no skill, agent designs everything from scratch.remotion_skill— the Remotion team's own production skill (SKILL.md+ 36-rule reference tree) installed at.claude/skills/. The best docs-as-skill you can buy.skillpack—skillpack scaffold react remotionruns first (timed), thenpnpm install(timed), then the agent.AGENTS.mdand the skillpack-wrapped Remotion skill auto-load.
The skillpack agent did zero setup commands. The other two cells spent 7 and 8.7 Bash calls each, mostly scaffolding. That's the result; the table below is the consequence.
| Cell | MP4 ✓ | 1st render | Turns | Tools | Output tokens | Cost | Agent time | Total time |
|---|---|---|---|---|---|---|---|---|
no_skill |
100% | 100% | 13±2 | 12±2 | 2,794 ± 153 | $0.210 ± 0.013 | 140 s | 140 s |
remotion_skill |
100% | 100% | 18±3 | 16±3 | 3,411 ± 611 | $0.262 ± 0.028 | 169 s | 170 s |
skillpack |
100% | 100% | 13±2 | 11±2 | 2,452 ± 779 | $0.192 ± 0.034 | 124 s | 144 s |
Skillpack is Pareto-optimal on every per-agent metric: cheapest (−9% vs
no_skill, −27% vs remotion_skill), fewest output tokens, fewest tool
calls, fastest agent time. Total wall-clock including the
scaffold+install step (~17 s) is only 4 s slower than no_skill — those
17 s of setup pay for themselves in saved agent work.
Two surprises in this table:
The maintainer's own skill is the most expensive cell.
remotion_skill succeeds first-try but spends 27% more dollars and 36%
more output tokens than skillpack. On a small task, the reading
overhead of a 36-file reference tree exceeds the work the skill saves.
This is the part most "let's add skills to our agent" projects don't
measure — a skill that ships ~30k tokens of rules every step is negative
ROI below some task size.
The tool-call mix tells the whole story. no_skill does Bash=7,
Write=5.3 (creating from scratch). remotion_skill does Bash=8.7,
Write=3, Read=1.7, Edit=1.3, Skill=1 (still creating, plus the skill
overhead). skillpack does Read=4.7, Bash=3, Edit=1.3, Skill=1,
Write=1 — zero setup commands. Just reads, edits, and uses the skill.
The cost number is the consequence of that decomposition, not an
independent dimension.
Canonical MP4s (trial-1):
no_skill.mp4
· remotion_skill.mp4
· skillpack.mp4.
Full writeup, methodology, per-trial data, caveats:
evals/workspaces/iteration-7/REPORT.md.
Footgun fixes, shipped as code
A Remotion 4 project that's wired almost right will typecheck happily,
run in the dev server happily, and then fail at headless render with
Visited "http://localhost:3000/index.html" but got no response
(React 19 flake) or this file does not contain registerRoot (missing
entry call) or MyVideo.js doesn't exist (TS .js extension imports
that webpack doesn't honour).
The official Remotion skill dodges these by telling the agent to run
npx create-video, which happens to pin React 18 and call
registerRoot for you — sidestepping the issues without ever naming
them. The fix is implicit; the next time create-video's defaults
change, the skill breaks silently.
skillpack dodges them explicitly: the react/remotion scaffold's
Root.tsx calls registerRoot directly and uses bare imports
(commit 8a2154c).
React version, renderer entrypoint, and module resolution are
version-controlled in the boilerplate, not in a sentence the agent might
or might not read.
The footgun lives in one place — the scaffold — and is fixed in code.
Not in a SKILL.md sentence the agent has to parse correctly on every
generation. Not in implicit transitive behaviour of a third-party CLI.
In code.
When upstream ships a fix to one of these, your scaffold inherits it; every future generation gets it for free.
When this pattern applies
The POC tests one shape of agent task: you point an agent at one focused job and want it to ship:
- "Generate a 15-second product reel for our launch tomorrow."
- "Build me a 12-slide investor deck from this outline."
- "Run the cohort retention analysis on this CSV and produce a chart."
- "Scaffold a dashboard that polls this endpoint and graphs the latency."
For tasks like these, the bottleneck is rarely the task itself — it's the task scaffold. The agent needs to pick a framework, pin compatible versions, dodge the well-known footguns, wire the entrypoints the headless tooling expects, and then start on the actual work. By the time it has, half its turns and a third of its tokens are gone on plumbing the user didn't ask about.
A scaffolded skill pre-pays that cost as a CLI command. The agent
inherits a working project and a small, tuned AGENTS.md primer, then
spends its turns on the task. In the eval above that translated to
−27% cost, −28% output tokens, and −27% agent wall-clock vs. the
same agent starting from the best docs-as-skill alternative — with the
same first-attempt render success rate.
The pattern is not worth the overhead when:
- The user is iterating on a long-running, multi-feature codebase —
scaffold-once doesn't help here; you want a long-lived
CLAUDE.md/AGENTS.mdinstead. - The boilerplate you'd need doesn't exist yet and isn't worth authoring — for a one-off, just hand-bootstrap.
- The task is dominated by genuine domain reasoning (e.g. "design our retry policy"), not by setup. Scaffolding doesn't help with judgement work.
Where this leaves things
The previous post ended with "encode the API surface once in code, don't make the agent rediscover it." This one ends with the same shape one layer up: encode the project surface once in code, don't make the agent rediscover that either.
The pattern that keeps working, across both posts: figure out the layer that's stable in your environment, encode it once as code, and let the agent spend its tokens on the part that genuinely varies. SDKs do this for API surfaces. skillpack does it for project surfaces. There's probably another layer above this — the task surface, whatever that turns out to mean. I'll write that one when I've shipped it.
skillpack on GitHub is the POC, its eval harness, and the iter-7 raw transcripts. It's a proof, not a project I have the time to maintain — I built it to test the claim above, the eval ran, the claim held up. If anyone wants to take it forward into something real, please do; open an issue and I'll happily hand over context, design notes, and the meta-skills already in the repo.