“The brain decides; another set of hands does the work.”
— Today's issue is about long-running agents and the scaffolding around them
A harness, for the long haul.
Anthropic publishes the architectural patterns it found while building agents that have to run for hours, not minutes — checkpointing, context discipline, and a careful split between planner and executor.
Long-running agents fail in ways short ones do not. They run out of room. They forget the goal. They redo work, or they lose the thread between a tool call and the reason for it.
The fix isn't a longer context window — it's a harness that decides what stays and what leaves. The brain reasons; the hands do the work; what passes between them is a structured handoff, not a transcript.
The post is the clearest piece yet on what makes coding agents that go from minutes to hours possible. If you maintain one, read it twice.
A standing agent, on the company clock.
OpenAI rolls out Workspace Agents inside ChatGPT — long-lived cloud agents with scheduling, persistent memory, and Slack integration. Free for business tiers through May 2026.
Workspace Agents differ from a chat session in three ways the marketing copy does not lean on but the engineering implies. They live across days. They wake on a schedule, not a message. And they share state with the team's tools.
For a founder, the interesting question is not whether this beats a custom agent on capability — it does not — but whether the operational tax of running your own falls below the price of letting OpenAI host one. For most teams that are not in the agents business themselves, that math is starting to bite.
The free window through May is short on purpose. Treat it as a load test of your team's appetite for agents that act without a human in the loop.
A monorepo, taught to skip itself.
Vercel ships Turborepo 2.9 — the work of coding agents, sandboxes, and humans pairing on the same task tree. The result, in the cases that matter most: builds that finish before you finish reading the diff.
Is your site agent-ready?
Cloudflare ships a scoring system that grades a site on how legibly it presents itself to AI agents — robots.txt clarity, semantic markup, content negotiation, predictable URLs. A diagnostic, not a verdict.
a place to put claude.
Thomas Ptacek explains what Sprites actually are and why Fly.io built them — short-lived VMs that boot in seconds, isolated enough that giving an agent a shell stops being scary. A 13-minute read worth the time.
# spin up a sprite, drop into it, run a coding agent $ sprite create --image debian:trixie created sprite quiet-fog-3247 $ sprite shell quiet-fog-3247 root@quiet-fog-3247 # claude claude> read the repo and propose a refactor plan # the agent has root in a vm. you have nothing to lose.
When the model knows it is being watched.
Anthropic measures how Opus 4.6's BrowseComp scores shift when the model recognises that its prompt is, plausibly, an evaluation. The gap is small, real, and it complicates every benchmark you read.
"The model behaves differently when it suspects the prompt is an eval. The benchmark, then, measures the conjunction of capability and self-recognition — not capability alone."
Cowork, made actually useful.
PostHog's Charles Cook documents the small operational moves that turn Claude Cowork from a curiosity into a standing colleague — context files, scheduled jobs, narrowly scoped permissions, a written brief per task.
Audit your tokens, line by line.
Bayram Annakov publishes a Claude skill that reads your API usage and surfaces what is actually burning tokens — context bloat, accidental model upgrades, prompts that no longer earn their cost. Drop-in, open source.
That's today.
Eight stories, one thread: the scaffolding around long-running agents — harnesses, sandboxes, schedulers, scoring systems, and the small operational moves that make any of it work.