Beyond LLMs: Anatomy of Agentic Systems - 2026 Edition Part II

From compound AI systems to the agentic loop, MCP, APIs, frameworks, and the rise of the Claws. A technical map of what it takes to build with AI agents in 2026.

May 15, 2026

This is Part 2 of a three-part series on building real products with AI. Part 1 covered the product thinking framework. This part goes into the technical layers. Part 3 connects it all back to defensibility and the value lifecycle.

In Part 1, the argument was that models alone aren't products. So what turns a model into something useful? This is where it gets technical.

Nobody Can Agree on What an Agent Is

There's a foundational quote from a 1994 paper by Michael Wooldridge and Nicholas Jennings that remains relevant:

> "Although the term is widely used by many people working in closely related areas, it defies attempts to produce a single universally accepted definition."

That was 1994. It's still true in 2026. OpenAI has a definition, Anthropic has one, Google has one, IBM has one. They all describe some form of autonomous system trying to reach an outcome, but the definitions differ.

What proves most useful is Anthropic's distinction: an agentic system is a distributed system composed of parts that derive autonomy and lead to outcomes. This separation of system from agent matters when building these things.

Agent definitions across major AI companies

Compound AI Systems

In February 2024, Matei Zaharia and colleagues at the Berkeley AI Research group published work on compound AI systems. The core idea: what you interact with when using ChatGPT isn't just a model. It's a system with many parts.

Models

The decision engine. Strong open source options (Llama, Mistral, DeepSeek) are approaching parity with closed models (GPT, Claude). You can run a 32GB model on your laptop and have meaningful intelligence locally.

Tools and Structured Output

Models don't call APIs directly. They output text that gets formatted into decisions. They were specifically trained to produce JSON, select tool schemas, and return structured parameters. The model is a selection brain: you give it options, it picks which function to call, and your application executes it.

Context and Retrieval

Semantic embeddings and vector search changed information discovery. Two words can be completely different in keyword space but nearly identical in semantic meaning. RAG (retrieval-augmented generation) brings external knowledge into the model's context window.

Context Engineering

How you fill the model's context window matters significantly. The model is lossy for content in the middle of the window (the "needle in a haystack" problem). What goes at the beginning and end receives more attention. Context engineering, managing that input window, affects quality as much as the model itself.

Model Customization

Parameter-efficient fine-tuning (LoRA, QLoRA) lets you specialize a model without training from scratch. You trim specific layers, add new ones, and get a model specialized in your domain.

MCP

Before MCP (Model Context Protocol), every application had to write its own tool schemas. Anthropic announced MCP in November 2024 and standardized how tools are discovered and described so that any application can query a server, get well-defined schemas, and pass them to the model. Within months, over 1,000 community MCP servers appeared.

Skills

Prompts accessible by semantic search. You learn something, encode it as a skill, and next time the system queries for a relevant prompt, it finds the best match. It works like how we learn: write it down, go back to it, eventually internalize it.

A2A

In April 2025, Google introduced the Agent-to-Agent (A2A) protocol for agents to communicate with each other. It supports long-lived streaming, authentication, mTLS, and identity. Different from MCP, which focuses on tool discovery.

Operations

Tracing, evaluation, security, safety. Models make many decisions per day. You need to know which tools they chose and whether those choices were right. Evaluation is often domain experts looking at output. Then you need guardrails for when the model does something unexpected, and red-teaming to stress-test before deployment.

The Agentic Loop

At the center of every agent system is a loop.

The model receives a query. It reasons about what to do, plans steps, and decides which tools to use. The application calls the tool, interacts with the environment (an API, a service, another application), then perceives the result, was it correct?, and loops back.

Each iteration is a turn. You can stop at each turn for human review (human in the loop), or let it run autonomously (human on the loop). The level of autonomy you grant depends on trust, and trust depends on evaluation, guardrails, tracing, and safety.

Delegation works the same way in human organizations. You delegate important tasks to people you trust. Before we delegate important decisions to models, we need the same level of confidence. That confidence comes from the compound system around the model, not from the model itself.

APIs: The Foundation Layer

When GPT first launched, the only interface was a chat window. Then came the APIs, and that opened the floodgates.

The Chat Completions API, released in March 2023, was OpenAI's first developer API. Simple, clean, easy to wrap application code around. Every framework built on this.

Then in March 2025, OpenAI released the Responses API and moved complexity server-side. Instead of every framework implementing tool calling, handoffs, and context management differently, the Responses API absorbed these into a single endpoint. MCP connections, reasoning loops, human-in-the-loop configuration, vector stores, file search. All in one API.

Anthropic released the Messages API with its own approach to identity and error handling. And at the end of 2025, Google shipped the Interactions API. Their reasoning: if you win the API layer, every framework uses you, every user comes to you.

Three major APIs now govern how we interface with models. It's a real competition.

To Framework or Not to Framework

A framework wraps the API and adds opinions. LangChain gives you chains and graphs. CrewAI defines agents as members of a crew. Haystack is strong at RAG pipelines. Google ADK integrates natively with A2A. OpenAI's Agents SDK has handoffs and guardrails built in. The list goes on, there are dozens.

Here's the surprising data point: the "State of AI Agents" report by Patronus AI and Berkeley, published in late 2025, found that 85% of people building agents in production don't use frameworks. They hit the API directly or build thin wrappers. Frameworks are useful for getting started and for learning, but many production systems outgrow them.

The argument mirrors the cloud-native debate. Some people want managed services and abstractions. Others want to control their own networking and hardware. It's a matter of taste and use case. Frameworks put opinions on how the pieces fit together, and sometimes that opinion doesn't match your requirements.

The Age of the Claws

Then came the Claws.

Claude Code, Anthropic's open source CLI released in early 2025, and tools like OpenClaw showed what happens when you give a model broad autonomy with access to tools. It triggered an explosion: NanoClaw (just calls the Claude SDK), ZeroClaw (performance-focused), PicoClaw (Rust, Raspberry Pi), IronClaw. Each finds an angle.

What these tools really demonstrated is that you can build the agentic loop fast if you understand networking, models, and tools. It's wrapper code on top of an agentic loop. Building the loop is the easy part. Making it secure, stable, and deployable is where everyone gets stuck.

That's where the industry is stuck right now. Everyone wants that level of autonomy. Not everyone has solved how to make it safe for production. Sandboxing, credential management, session isolation, audit trails, the boring but critical infrastructure that turns a demo into a product.

There's another reason these tools converged on similar stacks: TypeScript became the language of choice not just because it's popular, but because it's the "English of programming languages" for AI. Models have been trained on more TypeScript than almost anything else. It's also part of the loop, the model can modify its own codebase, which is harder with compiled languages.

The Harness: What Holds It All Together

We have covered a lot of components: models, tools, context, MCP, APIs, frameworks, the Claws. But components alone are not a system. Something has to orchestrate them at runtime, deciding what context enters the window, which tools are available, when to compact history, how to enforce safety, and when to stop. That something is the harness.

The formula is simple: Agent = Model + Harness. A raw model is a function: text in, text out. The harness is what makes it an agent: state, tool execution, feedback loops, enforceable constraints.

We got here in three steps. In 2022-2024, the focus was prompt engineering, crafting the right instructions. In 2025, it shifted to context engineering, managing everything the model sees. In 2026, it became harness engineering, building the full system around the model. Each layer subsumes the previous one.

Mitchell Hashimoto named it in February 2026. His framing was simple: every time an agent makes a mistake, engineer a permanent fix into the environment so it never happens again. An AGENTS.md rule. A custom linter. A script that catches a failure mode before the model sees it. Within weeks, OpenAI and Anthropic published their own takes, and the term stuck.

Phil Schmid offered a useful analogy: the model is the CPU, the context window is the RAM, and the harness is the operating system. The OS does not do the computation, but without it, the CPU is just silicon. The harness curates context, handles the boot sequence, provides standard drivers, and manages permissions. When things go wrong, it is usually the OS, not the CPU.

The harness is not the product. The product is the whole composition: model, enablers, differentiators, constraints. But the harness is what makes the enablers work together as a coherent system rather than a bag of parts.

Harness composition: builder harness vs user harness

Birgitta Böckeler drew a useful distinction. There is a builder harness, what ships with the tool: the context compaction strategy, the permission model, the session management, the safety layers. Claude Code's ASK/ALLOW/DENY system, Codex's sandbox policies, the doom-loop detection in OpenDev. These are architectural decisions baked into the runtime.

Then there is the user harness, what you add on top for your specific project and context. AGENTS.md files. Custom tools. Linters tuned to produce error messages that models can act on. Test suites that serve as feedback sensors. This is enablement in the language of Part 1, the petals that complete the whole product.

The builder harness changes rarely and is owned by the tool vendor. The user harness changes constantly and is owned by your team. Both are composed of the same types of concerns (context, safety, tools, memory), just at different layers of ownership.

The numbers back this up. Stanford's Meta-Harness paper found that harness design alone can cause up to a 6x difference in model performance on the same benchmark. Same model weights, same inference endpoint, different harness, wildly different results. LangChain proved it practically: changing only the harness moved their coding agent from outside the top 30 to the top 5 on TerminalBench 2.0. In the Hashline experiment, the same model jumped from 6.7% to 68.3% with zero weight changes. The Claws all prove this too. They use the same models. They differ in harness design. That is why they produce different outcomes.

In the whole product framework from Part 1: the model is the core. The compound system components (tools, context, MCP, ops) are the enablers. The harness is the orchestration that turns those enablers into a working system. And the user harness, the project-specific rules and tools you build up over time, is where your team's competitive edge accumulates.

What Comes Next

We have covered the technical stack and how it composes into a working system through the harness. All of this forms the features and activities layer of the whole product model from Part 1.

But features alone don't make a defensible product. In Part 3, we'll talk about what does: moats, distribution, case studies from NVIDIA, OpenAI, and Anthropic, and how to map the entire AI product evolution onto Moore's technology lifecycle.

The Technomist

Discussion about this post

Ready for more?