What long-running agents change, and what they don't

Most of the conversation about AI agents focuses on capability. What can the agent do? Can it reason through complex problems? Can it write production-quality code? Can it handle a full feature from spec to tests? These are reasonable questions. Capability matters. But they’re not the most important questions for someone trying to understand how their work actually changes.

The more important shift is temporal. Long-running agents change when work happens. Not how well it’s done, not whether human judgment is required, but the distribution of work across time. That’s the underrated change, and it compounds in ways that capability improvements don’t.

The capability framing

There’s a natural tendency to evaluate agents by asking what they can replace. Can this agent write code well enough to replace a junior developer? Can it debug independently? Can it design an architecture? The benchmark question. Capability as a threshold, and below the threshold you still need humans, above it you don’t.

This framing has the appeal of simplicity. You can test for it. You can point to specific outputs and evaluate their quality. You can construct a clear story about automation: as capability rises, human involvement falls. The substitution model is easy to think about.

But the substitution model misses something. Even a highly capable agent changes your work in a different way than the framing suggests. It doesn’t replace the work; it changes when the work is done and who is present for which parts of it. The agent isn’t a substitute for a person, it’s a way of extending the reach of a person’s judgment across time.

The persistence shift

Consider what it means to run an agent for eight hours overnight. You’ve approved a plan, the tests are locked in, the constraints are documented. The agent works through the night, commits at each stage, surfaces questions it can’t resolve. You wake up to a Git log, a set of completed tasks, and a short list of open questions requiring your attention.

What changed? Not the quality of your judgment. The decisions still required your expertise, your understanding of the system, your sense of what trade-offs were acceptable. But those decisions were made in the afternoon, not at 2am. The agent didn’t replace your judgment; it moved the exercise of that judgment to a different point in the day.

The work still happened. The implementation is real. But your presence during the implementation was no longer necessary, because the decisions that required your presence had been made, recorded, and locked before the implementation began. The agent is a way of front-loading human judgment so that it can be applied asynchronously.

This is a different kind of change than capability improvement. A more capable agent might do better work in those eight hours. A persistent agent does work in those eight hours that previously couldn’t happen at all, because you were asleep.

What extends across time

The key question isn’t “what can the agent do?” It’s “what decisions need to have been made before the agent starts?” Good async work depends on clarity at the handoff point. The spec needs to be precise enough that ambiguities don’t block implementation. The tests need to capture what done actually means. The constraints need to be explicit rather than tacit.

Long-running agents surface the cost of vague specifications in a way that synchronous work doesn’t. In synchronous development, a developer hits an ambiguous requirement and asks a clarifying question in thirty seconds. The cost of vagueness is low because feedback is immediate. In async work, the agent either makes an assumption (which might be wrong) or surfaces a blocker question (which pauses the work and waits for you to respond). The ambiguity that barely registers in synchronous work becomes a meaningful interruption in async work.

This means that what extends across time isn’t just execution capacity. It’s the quality of the artifacts that guided the execution: the spec, the plan, the locked test suite, the documented constraints. Work that was done at 4pm, while you were present, shapes what can happen at 2am while you’re not. The decisions you make during the synchronous phase determine the quality of the work in the async phase.

This is a different skill than the one required in purely synchronous development. It’s less about the moment of execution and more about the preparation that makes execution possible without you. Less about doing and more about enabling.

What doesn’t change

There’s a temptation, when observing what agents can do, to conclude that human judgment is becoming less important. The agent writes the code; I review it. The agent runs the tests; I check that they pass. If the agent is doing more of the work, surely it’s carrying more of the responsibility?

The experience of actually using long-running agents doesn’t match this intuition. What changes is not the importance of human judgment but the points at which it applies. You’re not present for the implementation, but you were present for the decisions that shaped the implementation. The spec you approved is the artifact of your judgment. The tests you locked are the expression of your understanding of what done means. The constraints you documented reflect your knowledge of the system.

When something goes wrong, the questions are still human questions. Did the spec capture the right requirements? Did the tests test the right things? Was the architectural approach sound? These are judgment calls, and a long-running agent doesn’t change that. It shifts the exercise of judgment toward earlier stages and away from the moment of implementation, but it doesn’t reduce the amount of judgment required.

Understanding the system remains important. Perhaps more so, because when you’re reviewing the results of overnight work, you need to evaluate not just what the agent produced but the assumptions it made along the way. Every ambiguity in a spec is a place where the agent made an interpretive choice. Every architectural decision it didn’t escalate is an assumption about what you would have chosen. Reviewing output tells you whether the code works. Reviewing assumptions tells you whether the agent understood the problem the same way you do.

This distinction matters because wrong assumptions compound. An agent that misreads a requirement early will produce internally consistent work that looks right but solves the wrong problem. The output passes tests, the code is clean, and the commit messages are sensible. But the premise was off. Catching that requires engaging with the reasoning, not just the result. It requires asking not only “does this work?” but “why did it work this way?”

The practical response is to make assumptions explicit — not just at the start, but continuously as the work unfolds. An agent that documents its assumptions to a file as it goes — what it interpreted the spec to mean, what trade-offs it chose, what it decided not to do, and what it decided later when the spec didn’t cover a case — gives you something concrete to review alongside the output. Each assumption is recorded with its impact: what changed in the implementation because of that interpretive choice. You don’t have to reverse-engineer the reasoning from the code. The reasoning is surfaced as an artifact, available for review at any point. When you find an incorrect assumption, you can trace exactly what it affected and what needs to change. The discipline of making assumptions visible is what turns review from quality control into genuine collaboration between human judgment and agent execution.

Correcting assumptions is also how the process improves over time. Each correction sharpens the artifacts: the spec gets more precise, the constraints get more explicit, the test suite captures more of what matters. The review phase isn’t just quality control on the output. It’s a feedback loop on the entire preparation-execution cycle. You can’t outsource the understanding. You can only change when you apply it.

The time structure of work

One of the things that changes with persistent agents is how you think about the structure of a working day. In synchronous development, work tends to be continuous: you sit down, you code, you stop when you need to stop. The working day is bounded by your presence. Work starts when you arrive and stops when you leave.

With long-running agents, the work structure becomes something more like a handoff pattern. There’s a setup phase, where you do the thinking and make the decisions and produce the artifacts that will guide the agent. There’s an execution phase, where the agent works and you’re doing other things. And there’s a review phase, where you return, assess what happened, and decide what comes next. Not a continuous flow, but a repeating cycle of preparation, execution, and assessment.

This doesn’t mean you work less. It means the work is differently distributed. The setup phase requires careful thinking, because vagueness at that stage propagates through the execution phase. The review phase requires genuine engagement: not just checking that the output works, but tracing the assumptions the agent made and deciding which ones need correcting. The execution phase is where the clock-hours go, but the human effort isn’t absent, it’s relocated.

For some kinds of work, this is a natural fit. Batch processing, multi-step implementation tasks, long test runs, research compilation: these are tasks where the structure of preparation, execution, and review maps well to how the work actually needs to happen. For other kinds of work, the latency introduced by the async cycle is friction. Fast iteration, exploratory coding, debugging: these depend on immediacy in a way that async patterns don’t serve as well.

Long-running agents are not a universal improvement to how software work gets done. They’re a shift in the time structure of work that suits certain kinds of tasks well and others less well. The judgment about which kind of task you’re doing, and which mode fits it, remains human.

The shape of change

What long-running agents change is not what you thought it would be. The early framing was: capable enough agents will replace human tasks. The actual shift is more subtle. Agents persist across time in a way that humans can’t. They extend the reach of human decisions into periods when the human isn’t present. They change not the quality of work but the calendar of work.

What they don’t change is the need for human understanding, for sound judgment at the decision points, for careful preparation of the artifacts that guide execution. The judgment isn’t automated away. It’s relocated, applied earlier and in a more concentrated form, so that the work it guides can unfold without you.

That relocation matters. Work that took a full day when continuous presence was required might now take a morning of careful preparation, an overnight execution, and an hour of review the following day. Not less work, but differently distributed work. Not less judgment, but judgment applied at different moments.

The question for anyone trying to work well with long-running agents isn’t “what can the agent do?” It’s “what decisions need to be made before the agent starts, and how do I make them well?” The agents extend the time over which work can happen. The thinking that makes that extension valuable remains ours.

Techcle Wiki

Explorer