Architecting The Agentic Stack: One Big Model Or Many Small?

Agentic AI is, paradoxically, both the most over-hyped topic and the most underappreciated relative to its contribution potential on board agendas today. I am often asked a simple question with a complicated answer: should we standardize on one powerful AI model for everything, or use many smaller models tuned to specific jobs?

The honest answer is neither extreme. The best-run companies across my think tank (the Executive Technology Board) do both — under one set of rules that keeps costs in check, protects the brand, and lets the business change direction without rebuilding everything. Here’s a practical framework I would recommend:

Start with outcomes, not models

Begin with the business result and the service levels you must hit — customer wait times, accuracy, compliance evidence, and cost per transaction. Work backward from there — this avoids technology debates and keeps decisions anchored to KPI improvements the board actually cares about.

My view is that if a task needs richer “thinking” across multiple steps (e.g., helping a customer handle billing, entitlements, and an address change), a general model often wins. If the task is narrow and repeatable (e.g., pulling fields off an invoice or sorting emails), a small, task-specific model is faster, cheaper, and more predictable.

Build for change (model mobility)

Assume your AI choices will change — because pricing, quality, and regulations will all change — and even if they don’t, your business needs will. Design the stack so you can swap models without redoing the plumbing. This isn’t just technical hygiene. It’s negotiation leverage and operational resilience. Two practical moves make that possible:

Keep a thin adapter between your applications and any given model so you can change vendors or versions with minimal rework. And put every “action” an agent can take—update a record, create a ticket, send a message—behind a simple registry with permissions and audit logs. Those two moves preserve speed today and optionality tomorrow.

Ground everything in your facts and systems

Ungrounded AI is a demo, not a production system. Three keys to success in my mind are: In production, answers must be tied to vetted data (product facts, policies, records) so outputs are anchored in truth. Actions must flow through systems of record with full traceability to protect safety and brand. And policy checks for privacy, regional rules, and restricted content should run automatically, in-line with the workflow, not as after-the-fact reviews.

Grounded design protects the brand, reduces rework, makes compliance evidence a by-product, and turns AI from a clever tool into dependable infrastructure.

Measure continuously

Accuracy isn’t a one-time certificate; it’s a cadence. Each workflow should ship with a compact golden test set of real cases that never changes, so you can benchmark apples-to-apples over time. Any prompt or model change should start as a canary on a small slice of traffic, with automatic regression checks so quality and latency don’t quietly degrade. And you should have a rollback plan that’s fast and practiced. This operating rhythm keeps surprises out of customer journeys, and board meetings.

A simple decision rule that reduces noise

Give teams a lightweight rule so every project doesn’t relitigate the basics. If the work is complex and conversational, start with a general model; migrate to a smaller one only when it achieves the same quality at lower cost. If the work is narrow and structured, start small; strengthen outputs with validators and simple rules, and route edge cases for human review. Respect hard constraints along the way: sub-second SLAs tend to favor smaller models or distillations; strict data-residency rules favor models that run where your data lives; high traceability needs lean toward more deterministic approaches.

Govern with a one-page inventory (AIBOM)

Keep a living, one-page “AI Bill of Materials” for every production workload. It lists:

What models are in use (general vs. task-specific)
Which data they rely on (owner, retention, consent)
Which actions they’re allowed to take (with permissions and logs)
What thresholds define acceptable quality and response times
What controls are in place (privacy, security, audit)
What the exit plan is (your right to export logs/tests and switch vendors)

This gives Procurement, Security, Data, and Engineering a single source of truth. It also turns portability and telemetry access into contract terms, not aspirations.

Price to “resolved outcomes,” not tokens

Boards don’t buy tokens; they buy results. Tie scale-up funding to verified movement in the metrics that move cash: touch time removed, right-first-time, cycle time, DSO/DPO, deflection rate, CSAT. Express economics as cost per resolved outcome — a solved case, a completed transaction, a deflected contact — then manage three buckets: Fixed costs: Orchestration, data indexing, monitoring, and testing; Variable costs: Actual inference, downstream system calls, and any human QA; Risk costs: Incidents, re-tuning, and regulator evidence packs. This framing keeps investment disciplined and impact visible.

What every board should ask

AI is now on every board agenda across many companies I work with. Amidst the noise and clutter of Agentic AI, here are the five questions that will serve well for every board to ask:

Outcome clarity: Which KPI and SLA is each AI workflow on the hook to improve this quarter?
Grounding: Is every workflow tied to our approved data and systems, with privacy and regional rules enforced automatically?
Mobility: How quickly can we switch models or vendors if price, quality, or policy changes — and do we have contract rights to the telemetry and test data to make that practical?
Quality cadence: Do we have golden test sets, canaries, and rollbacks in place — and who signs off on changes?
Unit economics: What is our cost per resolved outcome today, and what’s the target for the next two quarters?

Bottom line: Don’t pick a “one model” or “many models” religion. Build a portfolio that earns the right to use a powerful general model where it matters and proves the case for small models where they win — under one fabric of grounding, measurement, and portability. When you do that, the AI conversation becomes straightforward: less about model hype, more about reliable improvements to the P&L.

What's Hot

Architecting The Agentic Stack: One Big Model Or Many Small?

Start with outcomes, not models

Build for change (model mobility)

Ground everything in your facts and systems

Measure continuously

A simple decision rule that reduces noise

Govern with a one-page inventory (AIBOM)

Price to “resolved outcomes,” not tokens

What every board should ask

Keep Reading

News

Mobile Apps

Subscribe to Updates