Archly — System Design Mastery

#56AI Agent Platform

Long-running multi-step LLM agents — durable workflow + sandboxed execution + LLM gateway. Brain / hands / state are independently replaceable. The agent run is a workflow, not a request.

Saved on this device

Build the platform that powers long-running multi-step LLM agents — Devin, Manus, OpenAI Codex agents, Cursor agents, Claude Code background agents, Replit Agent, Lindy. Users describe a task; the platform spawns an agent that plans, calls an LLM, calls tools, executes code in a sandbox, captures results, and iterates — sometimes for minutes, sometimes for hours, sometimes (Cursor's Solid→React migration) for **three weeks**. The architectural pressure here is **not LLM inference** (that's an external dependency) — it's **durability**, **isolation**, and **cost containment** under failure. Three load-bearing concepts: 1. **An agent run is a durable workflow, not a request.** Brain (LLM gateway) / hands (sandbox) / state (orchestrator + history) are independently replaceable. When the sandbox host dies 90 minutes into a 2-hour Devin task, the workflow survives because Temporal's history is the source of truth. (Cognition's published architecture; Temporal's Replit case study.) 2. **KV-cache-aware routing is not optional.** Manus's blog calls KV-cache hit rate "the single most important metric for a production-stage AI agent" — Anthropic's prompt-caching read price is 10% of base, so a 90% cache hit rate cuts cost by ~80%. A naïve OpenAI-protocol-compatible passthrough cannot do this; the gateway must understand prefix locality. 3. **Cost runaway is enforcement, not alerting.** The published $47K LangChain incident (four agents in an unintended infinite loop, 11 days, $47K bill) is the canonical retro. Per-tenant kill-switches must fire at minute granularity, not at the daily billing job. Replit users have reported $30/hr → $360/day before their 2025 controls landed.

Reading: Cognition — Devin's 2025 Performance Review · Cognition — What We Learned Building Cloud Agents · Manus — Context Engineering for AI Agents (KV-cache hit rate) · Anthropic — Prompt Caching docs (5min/1h TTL) · Anthropic — Postmortem of Three Recent Issues (Aug/Sep 2025) · Anthropic — Claude Code sandboxing · Temporal — Replit Agent case study · Temporal — Of course you can build dynamic AI agents · Diagrid — Checkpoints Are Not Durable Execution · E2B — Firecracker vs QEMU (125 ms boot, 4 000 microVMs/host) · Modal — Top AI Code Sandbox Products 2025 · Simon Willison — The lethal trifecta for AI agents · OWASP — Top 10 for Agentic Applications (Dec 2025) · Anthropic — EscapeRoute MCP CVEs (CVE-2025-53109/53110) · Replit — Effort-based pricing recap (July 2025 billing bug) · OpenTelemetry — GenAI semantic conventions (gen_ai.operation.name)

durable execution (Temporal-style event-sourced workflow)

brain / hands / state separation

Firecracker microVM sandboxing

fork-from-snapshot

KV-cache-aware sticky LLM routing

Anthropic prompt caching (90% read discount)

MCP (Model Context Protocol) tool gateway

OAuth 2.1 PKCE for tool capabilities

per-tenant token-budget kill-switch

cost runaway circuit-breaker

SSE token streaming with resumable offsets

human-in-the-loop signal with timeout

indirect prompt-injection defense

snapshot+restore on sandbox crash

non-determinism error / workflow versioning