How Agents Learn From Mistakes

The Problem

Agents make mistakes. Unlike traditional software (deterministic, predictable), agents operate with uncertainty. The question isn't if they'll make mistakes, but:

Recovery — How do you undo what went wrong?
Prevention — How do you avoid mistakes in the first place?
Learning — How does the agent get better over time?

Recovery: Git for Everything

The simplest answer: version control all state the agent can touch.

File system → Git
Database → Transaction logs, snapshots
External APIs → Harder—some actions are irreversible

Infrastructure implication: Agent runtime needs first-class primitives:

"Checkpoint before risky action"
"Rollback to checkpoint"
Not just for code—for everything the agent touches

The irreversibility problem: Some actions can't be undone (sent emails, API calls, published content). These need confirmation gates, not rollback.

Prevention: Rehearsal and Dry-Run

Before acting on the real world:

Dry-run mode:

Agent explains what it would do, without doing it
User reviews, approves, then agent executes
Already exists in Claude Code (tool approval)

Sandbox environment:

Clone of production where agent can experiment
Test destructive actions safely
Validate approach before committing

Simulation:

For external APIs, mock responses to test logic
"What would happen if the API returned X?"

The Tour Model (Refinement)

The binary sandbox/reality split is too simple. Better mental model: performance tours.

Rehearse enough to be ready
Perform (real stakes, real audience)
Learn from real feedback
Refine for next performance
Repeat

Each show is "real" but bounded. Mistakes contained to one night, not career-ending.

What deserves rehearsal:

Criterion	Rehearse	Just do it
Reversibility	Can't undo	Can rollback
Cost of failure	High	Low
Repetition	Will do many times	One-off
Complexity	Many steps	Simple

The insight: Maybe the answer isn't "bigger sandbox" but smaller blast radius in reality. Make real actions reversible/containable, so the world itself becomes safe to act in.

The Manager Model (Not Micromanagement)

The "approve every action" model is micromanagement. It doesn't scale.

How bosses actually work:

Pattern	Not	But
Approval	"Can I send this?"	"Here's what I did"
Oversight	Watch every keystroke	Review outcomes
Feedback	"No, do it this way" (before)	"Next time, try X" (after)

The shift:

Current: Human approves → Agent acts
Future:  Agent acts → Human reviews (async) → Agent learns

Agent operates autonomously. Human reviews a digest: "Here's what I did today. These 3 things I wasn't sure about."

Current tool approval is training wheels. Useful for building trust, but the goal is to remove them.

Learning: The Hard Part

How humans learn from mistakes:

Mechanism	How it works
Short-term memory	What just happened in this session
Long-term memory	Patterns accumulated over years
Feedback loops	Someone tells you it was wrong
Intuition	Pattern recognition from experience

How agents work today:

Mechanism	Agent equivalent	Status
Short-term memory	Context window	✓ Works
Long-term memory	???	Gap
Feedback loops	???	Gap
Intuition	???	Gap

Possible solutions:

Long-term memory:

External knowledge base agent can read/write
This KB (MindCapsule) is literally this pattern
Agent writes learnings → persists across sessions

Feedback:

Structured "that was wrong because X" mechanism
Saved to memory, surfaced in future similar situations
Human corrects → agent remembers correction

Intuition (accumulated patterns):

Distill patterns from past sessions
Add to system prompt or skills
"Progressive hardening" of learned behavior

Practice:

Replay past mistakes in sandbox
Try different approaches
Learn without real-world consequences

When Should Agents Learn?

During deployment (online learning):

Agent behavior evolves in real-time
Risk: drift, hard to audit, unpredictable
Benefit: immediate adaptation

Between versions (offline learning):

Human reviews agent's proposed learnings
Updates prompt/config deliberately
Risk: slower adaptation
Benefit: controlled, auditable

Hybrid (propose → approve):

Agent proposes learnings from session
Human approves what becomes permanent
Balance of adaptation and control

This is probably the right answer for now. Agent suggests, human curates, approved patterns get baked in.

The Graduated Trust Model

Like a new employee:

Training wheels: All actions require approval
Supervised: Dangerous actions need approval, routine actions autonomous
Trusted: Most actions autonomous, only irreversible actions flagged
Expert: Full autonomy with audit log

Trust is earned through demonstrated competence, not assumed.

How to implement:

Track success/failure rate by action type
Automatically adjust approval requirements
Human can override trust level anytime

Open Questions

How do you define "similar situation" for retrieving past learnings?
What's the right granularity for feedback? (action-level? session-level? outcome-level?)
How do you prevent learned patterns from becoming stale?
Can agents learn from each other? (multi-agent knowledge sharing)

The Fundamental Memory Limitation

Current LLMs (including me) have a structural problem with learning:

What I do poorly:

Proactive memory: Don't automatically check knowledge base unless prompted
Importance weighting: Everything in context gets roughly equal attention
Nuance detection: Can't tell a passing comment from a core principle

What "weight" means to humans vs LLMs:

Human	LLM
Emotional charge → remembers	No emotional memory
High stakes → attention	No felt consequences
Surprise → salience	No expectation violation
Connected to many things → important	Only if explicitly linked

What helps:

Explicit importance markers in docs ("Core principle:")
Structured summaries (top 3 things that matter)
Session start rituals (force reading key docs)
Repetition (important things in multiple places)
User saying "this is important" → note prominently

The hard truth: Each session reconstructs understanding from text, not recalling lived experience. This is the agent learning problem—we don't accumulate intuition.

Agent.md as the Future of Software - The broader vision this enables
Domain-Specific Agents Over General-Purpose - Narrow tools reduce mistake surface
The Builder's Curse - Users need to understand what went wrong