#56AI Agent Platform
Long-running multi-step LLM agents — durable workflow + sandboxed execution + LLM gateway. Brain / hands / state are independently replaceable. The agent run is a workflow, not a request.
Build the platform that powers long-running multi-step LLM agents — Devin, Manus, OpenAI Codex agents, Cursor agents, Claude Code background agents, Replit Agent, Lindy. Users describe a task; the platform spawns an agent that plans, calls an LLM, calls tools, executes code in a sandbox, captures results, and iterates — sometimes for minutes, sometimes for hours, sometimes (Cursor's Solid→React migration) for **three weeks**. The architectural pressure here is **not LLM inference** (that's an external dependency) — it's **durability**, **isolation**, and **cost containment** under failure. Three load-bearing concepts: 1. **An agent run is a durable workflow, not a request.** Brain (LLM gateway) / hands (sandbox) / state (orchestrator + history) are independently replaceable. When the sandbox host dies 90 minutes into a 2-hour Devin task, the workflow survives because Temporal's history is the source of truth. (Cognition's published architecture; Temporal's Replit case study.) 2. **KV-cache-aware routing is not optional.** Manus's blog calls KV-cache hit rate "the single most important metric for a production-stage AI agent" — Anthropic's prompt-caching read price is 10% of base, so a 90% cache hit rate cuts cost by ~80%. A naïve OpenAI-protocol-compatible passthrough cannot do this; the gateway must understand prefix locality. 3. **Cost runaway is enforcement, not alerting.** The published $47K LangChain incident (four agents in an unintended infinite loop, 11 days, $47K bill) is the canonical retro. Per-tenant kill-switches must fire at minute granularity, not at the daily billing job. Replit users have reported $30/hr → $360/day before their 2025 controls landed.
Reading: Cognition — Devin's 2025 Performance Review · Cognition — What We Learned Building Cloud Agents · Manus — Context Engineering for AI Agents (KV-cache hit rate) · Anthropic — Prompt Caching docs (5min/1h TTL) · Anthropic — Postmortem of Three Recent Issues (Aug/Sep 2025) · Anthropic — Claude Code sandboxing · Temporal — Replit Agent case study · Temporal — Of course you can build dynamic AI agents · Diagrid — Checkpoints Are Not Durable Execution · E2B — Firecracker vs QEMU (125 ms boot, 4 000 microVMs/host) · Modal — Top AI Code Sandbox Products 2025 · Simon Willison — The lethal trifecta for AI agents · OWASP — Top 10 for Agentic Applications (Dec 2025) · Anthropic — EscapeRoute MCP CVEs (CVE-2025-53109/53110) · Replit — Effort-based pricing recap (July 2025 billing bug) · OpenTelemetry — GenAI semantic conventions (gen_ai.operation.name)
durable execution (Temporal-style event-sourced workflow)
brain / hands / state separation
Firecracker microVM sandboxing
fork-from-snapshot
KV-cache-aware sticky LLM routing
Anthropic prompt caching (90% read discount)
MCP (Model Context Protocol) tool gateway
OAuth 2.1 PKCE for tool capabilities
per-tenant token-budget kill-switch
cost runaway circuit-breaker
SSE token streaming with resumable offsets
human-in-the-loop signal with timeout
indirect prompt-injection defense
snapshot+restore on sandbox crash
non-determinism error / workflow versioning