Code Mode for a complex API: why a coding agent doesn't need MCP

Cloudflare's Code Mode post argued LLMs are better at writing code against a small SDK than at picking from a menu of MCP tools. The argument is general; it gets sharper when the agent's output is itself code.

monday.com/vibe is a coding agent that generates entire apps against monday's GraphQL API. We tested it both ways — once with monday's official MCP server attached (66 tools, ~137,000 characters of tool definitions), once with our typed Board SDK — and measured what each costs.

Same model. Same four tasks. Same staging board. Identical correct outcomes.

SDK setup MCP setup
Input tokens per task (mean) 15,626 ~158,000
Model steps per task 1.0 4.0
Wall-clock per task ~9.5 s ~26 s
Cost per task @ Gemini Pro pricing $0.025 $0.210

8.4× the inference cost per task for the same answer. monday.com/vibe generates thousands of apps every day, each requiring many code-gen steps; compounded, the gap is millions of dollars a year. And as the rest of this post argues, the inference bill is the least important benefit of the SDK side.

The Code Mode argument, sharpened by code-generation

Cloudflare's framing in two sentences: when you give an LLM 50 MCP tools, the model reads every tool description on every step, parameterizes each call carefully, and reasons in tool-vocab. When you give it an SDK and a code execution environment, the LLM works in its native medium — JavaScript or TypeScript that it's seen ten million examples of. Output quality goes up, token spend goes down.

For an agent like Vibe, the argument compounds. Vibe doesn't just use an API — it generates code that uses the API, code that ships inside the user's app. The user's app at runtime has no MCP server, no mondayClient, no agent-time tool routing. Whatever the agent decided to do, the runtime code has to be standalone JavaScript talking to monday's GraphQL endpoint directly.

So even an MCP-attached agent has to eventually write fetch + GraphQL into the generated app. MCP tools are agent-time scratchpad, not runtime infrastructure. The comparison isn't "MCP at runtime vs. SDK at runtime" — it's "MCP as an agent-time reference manual vs. SDK as an agent-time grammar." A reference manual the agent reads cover-to-cover on every step versus a grammar it just writes in.

Why monday makes the gap especially wide

monday's GraphQL has three properties no general MCP can compress away.

Per-board schemas. Every customer's board has user-defined columns, identified in GraphQL by opaque IDs like color_mm3b9bgw and numeric_jjk44p2x. They're unique per board and meaningless to read.

With the SDK, the agent writes the column it cares about by name:

board.items().withColumns(["status", "budget"]).execute();

The SDK resolves "status" to color_mm3b9bgw and "budget" to numeric_jjk44p2x under the hood. The agent's code reads like the domain.

With MCP, the agent has none of that. It has to first call get_board_info, scan the column list to figure out which opaque ID is the status column for this board, then thread that ID through every subsequent query:

column_values(ids: ["color_mm3b9bgw", "numeric_jjk44p2x"]) { value text }

Every filter, every read, every mutation, every generated app — the agent re-derives the same mapping, and the runtime code ends up unreadable to anyone trying to maintain it later.

Column-value JSON shapes. monday's column values are JSON-encoded objects whose shape varies per column type, in ways that don't follow any one rule:

  • Status: { label: "Done" } — or { index: 2 } if you want speed at the cost of needing a label-to-index map
  • People: { personsAndTeams: [{ id: 4828557, kind: "person" }] }id must be an integer (string id silently rejected); kind is required; mixing in a team uses the same array with kind: "team"
  • Date: { date: "YYYY-MM-DD" } — passing a Date object or ISO timestamp silently fails
  • Dropdown: { ids: [1, 3] } or { labels: ["infra","security"] } depending on caller intent
  • Timeline: { from: "YYYY-MM-DD", to: "YYYY-MM-DD" }
  • Location: { lat: "32.0853", lng: "34.7818", address: "..." } — lat/lng as strings, address optional

The SDK takes natural values typed by column. The agent writes:

board.item().create({
  name: "Audit Q2 access logs",
  status: "In Progress",
  owner: [4828557],
  dueDate: "2026-05-22",
  timeline: { from: "2026-05-18", to: "2026-05-22" },
}).execute();

…and the SDK encodes each value into the right JSON shape internally. On the read side it's the inverse: item.status is the string "Done", item.dueDate is a Date object, item.location is { lat, lng, address } — already parsed, no JSON.parse required.

With MCP, the agent has none of that. It builds each shape by hand from the schema docs, every mutation, every task, then JSON.parse every column_values[].value on the read side. Every shape is a place to silently get it wrong — pass "Done" instead of { label: "Done" } and the API accepts it as a 200 OK with no effect.

Status filters take an index, not the label. This one's nasty: you write board.items().where({ Status: "Done" }) in the SDK and it works. Through monday's MCP-style API you have to call get_board_info, fetch settings_str, parse the JSON, build a { "0":"Active", "1":"In Progress", "2":"Done", "3":"Stuck" } map, look up the index for "Done", and pass compare_value: ["2"] to query_params. Six lines of plumbing on every filter, in every generated app.

Each of these is a place the SDK has encoded the answer once and forever. MCP rediscovers them every time.

We measured it

Four tasks against a real staging board with eight columns of varied weirdness:

  • Create an item with timeline + location + dropdown + link columns set
  • Bulk multi-update items filtered by status
  • Read items where both Timeline AND Location are populated
  • Return items whose timeline overlaps a given window

Both setups got monday's full surface. The MCP setup got the real @mondaydotcomorg/agent-toolkit/mcp package — 66 tools, ~137k chars of tool definitions, the same surface a Claude Desktop user would see — spawned via stdio with a fetch override routing requests at our staging endpoint. The SDK setup got our actual production system prompt: MONDAY_SDK_TYPES_PROMPT plus the output of buildBoardPrompt(...) — the exact ~1,500-line reference the production Vibe agent receives for any given board.

Both setups were asked to produce a JavaScript module that exports a default async solve() function. The runtime contract: that function will run inside a user's monday app, with no MCP available. (We verified post-hoc that the MCP setup's submitted code is raw fetch + GraphQL — no MCP tool calls leaked into runtime code.)

The full per-task table:

Task Arm Input Output Steps Code chars
Create item w/ weird columns sdk 15,657 730 1 1,885
Create item w/ weird columns mcp 76,881 973 2 2,264
Bulk multi-update sdk 15,605 341 1 680
Bulk multi-update mcp ~190,000 1,450 5 2,502
Read items w/ Timeline AND Location sdk 15,632 322 1 898
Read items w/ Timeline AND Location mcp ~200,000 1,367 5 2,387
Timeline overlap query sdk 15,611 700 1 1,818
Timeline overlap query mcp ~155,000 1,162 4 2,254

Where the MCP setup's tokens go: monday's MCP server ships ~34k tokens of tool definitions on every model step. With the agent averaging 4 steps per task (it loops: discover schema, write code, validate, sometimes retry), you accumulate 150–200k input tokens before the agent even submits a solution. The SDK prompt is also heavy (~12k tokens) but the agent reads it once, and it's stable across tasks — easily cached.

The dollar math at Gemini 2.5 Pro pricing (~$1.25/M input, $10/M output):

  • SDK setup: $0.025/task — 15,626 × $1.25/M + 523 × $10/M
  • MCP setup: $0.210/task — 158,000 × $1.25/M + 1,238 × $10/M
  • 8.4× per-task multiplier, compounded across many generation steps per app, across thousands of apps a day. The annual gap runs into the millions.

What each setup actually wrote

Read off disk from the eval's submit_solution payloads. Same task: bulk multi-update of items with status "Stuck", setting them to "In Progress" with effort 1.

SDK — 24 lines, 680 chars:

import BoardSDK from '@api/BoardSDK.js';

export default async function solve() {
  const board = new BoardSDK();

  const stuckItems = await board.items()
    .withColumns(["status"])
    .where({ status: "Stuck" })
    .execute();

  const updatedIds = [];
  for (const item of stuckItems.items) {
    const r = await board.item(item.id).update({
      status: "In Progress",
      effort: 1
    }).execute();
    updatedIds.push(r.id);
  }
  return updatedIds;
}

MCP — 80 lines, 2,502 chars (excerpted):

export default async function solve() {
  const MONDAY_API_TOKEN = process.env.MONDAY_API_TOKEN;
  const BOARD_ID = "5002501676";
  const STATUS_COL = "color_mm3b9bgw";
  const EFFORT_COL = "numeric_mm3bct1p";

  // 1. Look up the index for "Stuck" from settings_str
  const boardInfo = await fetch('https://api.monday.com/v2', {
    method: 'POST',
    headers: { Authorization: MONDAY_API_TOKEN, 'Content-Type': 'application/json' },
    body: JSON.stringify({
      query: `query { boards(ids: ["${BOARD_ID}"]) {
        columns(ids: ["${STATUS_COL}"]) { settings_str } } }`,
    }),
  }).then(r => r.json());
  const settings = JSON.parse(boardInfo.data.boards[0].columns[0].settings_str);
  const stuckIndex = Object.entries(settings.labels)
    .find(([, v]) => v === 'Stuck')?.[0];

  // 2. Query items with status = Stuck (using the index)
  const itemsRes = await fetch('https://api.monday.com/v2', {
    method: 'POST',
    headers: { Authorization: MONDAY_API_TOKEN, 'Content-Type': 'application/json' },
    body: JSON.stringify({
      query: `query { boards(ids: ["${BOARD_ID}"]) {
        items_page(limit: 500, query_params: { rules: [
          { column_id: "${STATUS_COL}",
            compare_value: [${stuckIndex}],
            operator: any_of }
        ]}) { items { id name } } } }`,
    }),
  }).then(r => r.json());

  // 3. For each item, JSON.stringify the column_values and mutate
  const updatedIds = [];
  for (const item of itemsRes.data.boards[0].items_page.items) {
    const cv = JSON.stringify({
      [STATUS_COL]: { label: 'In Progress' },
      [EFFORT_COL]: '1',
    });
    // ... mutation fetch, JSON.stringify of column_values again, error
    // handling, push id ...
  }
  return updatedIds;
}

Same outcome. The MCP code is verbose because every weird shape monday's GraphQL accepts has to be spelled out from first principles. The SDK code is short because the SDK encodes the answers.

Multiply that line count by every mutation, every filter, every aggregate in every generated app. The chars-of-code number is a proxy for surface area — surface area where the agent can silently pass "Done" instead of { label: "Done" } and ship code that no-ops in production.

Benefits the eval doesn't measure

Three things are true of the SDK setup's code that don't show up in token counts. Each is bigger than the inference-cost story.

Future-proof generated apps. monday's GraphQL evolves — deprecations, response-shape changes, new pagination tokens, column types that gain variants. Every Vibe app generated through MCP has the API surface baked into its source code. When monday ships a change, those apps break.

SDK-generated code calls board.items().withColumns([...]). The SDK adapts internally to API drift; the user's app keeps working. Every app the SDK has ever generated stays correct under API drift; every MCP-generated app is frozen against the API shape it was written against.

This isn't speculative — monday's GraphQL has shifted many times. items_page rolled out and pushed items into legacy. Status filters in query_params moved from label to index. Timeline column's read shape changed. Every one of those would have silently broken raw-GraphQL apps generated before the change.

Edge-case hardening, free. The production Board SDK has years of accumulated handling for cases that look like one-liners but aren't: status writes that accept { label } or { index }; people IDs that have to be integers but where the API rejects strings silently; dropdown filters using { ids } vs { labels } depending on caller intent; date columns that round-trip timezones differently than they accept; pagination cursors that expire under load.

The agent writing SDK calls inherits all of that. The agent writing raw GraphQL has to rediscover each edge case the first time it bites a user — and "the first time it bites a user" is the worst possible time to discover it.

Context window is finite. Every byte of MCP tool definition is a byte the model can't spend on the actual task — long code files, longer reasoning, more context about the user's request. At ~34k tokens of tool defs on every step, the MCP setup permanently runs with ~17% less effective context window than the SDK setup. That tax shows up in the eval as "more steps to converge"; it's the same tax.

Each of these compounds the dollar story. The SDK's wins aren't just smaller, they're durable. Generated apps stay correct, the edge cases the SDK handles stay handled, and the agent has more room to think about what actually needs thinking.

When MCP is the right tool

None of the above is an argument against MCP in general. MCP wins decisively when:

  • The agent acts at agent-time, not generates code — Claude Desktop summarizing your docs, an analyst chat agent triaging tickets, a CI bot opening pull requests
  • The API surface is small — a dozen tools, not 66
  • The schema isn't user-defined per customer
  • There's no downstream code-generation use case

Same monday API, different agent shape: an MCP-attached chat agent answering "how many items are in my Stuck column right now?" is exactly what MCP is for. Read at agent-time, no runtime code, no schema-per- customer, finite tool surface for that one user.

The architectural question isn't "MCP or SDK." It's "what's the right primary surface for this agent's job?" For a coding agent generating per-customer apps against a complex API, the answer is the SDK. For an interactive agent calling the same API on behalf of one user at a time, the answer is MCP.

The rest of the answer

The SDK isn't the only piece. Even as the primary surface, it can't cover every API use case Vibe-generated apps need.

We run three layers in production:

  • Board SDK — the typed grammar above; what most generated app code calls
  • Skills — a registry of domain-specific helpers any team at monday can use to extend Vibe with their own SDKs
  • MAPI subagent — the last escape hatch. When raw GraphQL is needed, the main agent delegates to a specialized text-to-GraphQL subagent optimized for that one job. The coding agent never sees raw GraphQL itself

Each layer keeps cheap-tier work out of the expensive coding-agent loop. Same pattern as the previous post on static analysis: the rule is to keep the boring, deterministic, expensive-to-discover work in cheap loops, and reserve the expensive model for actual generation.

A follow-up post will walk through how the three layers together cover the long tail of the monday API. For now: if you're building a coding agent against a complex API, start with the SDK, not MCP. The inference bill is just the start of the math.

← All posts