How Agents Learn From Mistakes

· agents, product-design

The Problem

Agents make mistakes. Unlike traditional software (deterministic, predictable), agents operate with uncertainty. The question isn't if they'll make mistakes, but:

  1. Recovery — How do you undo what went wrong?
  2. Prevention — How do you avoid mistakes in the first place?
  3. Learning — How does the agent get better over time?

Recovery: Git for Everything

The simplest answer: version control all state the agent can touch.

  • File system → Git
  • Database → Transaction logs, snapshots
  • External APIs → Harder—some actions are irreversible

Infrastructure implication: Agent runtime needs first-class primitives:

  • "Checkpoint before risky action"
  • "Rollback to checkpoint"
  • Not just for code—for everything the agent touches

The irreversibility problem: Some actions can't be undone (sent emails, API calls, published content). These need confirmation gates, not rollback.

Prevention: Rehearsal and Dry-Run

Before acting on the real world:

Dry-run mode:

  • Agent explains what it would do, without doing it
  • User reviews, approves, then agent executes
  • Already exists in Claude Code (tool approval)

Sandbox environment:

  • Clone of production where agent can experiment
  • Test destructive actions safely
  • Validate approach before committing

Simulation:

  • For external APIs, mock responses to test logic
  • "What would happen if the API returned X?"

The Tour Model (Refinement)

The binary sandbox/reality split is too simple. Better mental model: performance tours.

  • Rehearse enough to be ready
  • Perform (real stakes, real audience)
  • Learn from real feedback
  • Refine for next performance
  • Repeat

Each show is "real" but bounded. Mistakes contained to one night, not career-ending.

What deserves rehearsal:

Criterion Rehearse Just do it
Reversibility Can't undo Can rollback
Cost of failure High Low
Repetition Will do many times One-off
Complexity Many steps Simple

The insight: Maybe the answer isn't "bigger sandbox" but smaller blast radius in reality. Make real actions reversible/containable, so the world itself becomes safe to act in.

The Manager Model (Not Micromanagement)

The "approve every action" model is micromanagement. It doesn't scale.

How bosses actually work:

Pattern Not But
Approval "Can I send this?" "Here's what I did"
Oversight Watch every keystroke Review outcomes
Feedback "No, do it this way" (before) "Next time, try X" (after)

The shift:

Current: Human approves → Agent acts
Future:  Agent acts → Human reviews (async) → Agent learns

Agent operates autonomously. Human reviews a digest: "Here's what I did today. These 3 things I wasn't sure about."

Current tool approval is training wheels. Useful for building trust, but the goal is to remove them.

Learning: The Hard Part

How humans learn from mistakes:

Mechanism How it works
Short-term memory What just happened in this session
Long-term memory Patterns accumulated over years
Feedback loops Someone tells you it was wrong
Intuition Pattern recognition from experience

How agents work today:

Mechanism Agent equivalent Status
Short-term memory Context window ✓ Works
Long-term memory ??? Gap
Feedback loops ??? Gap
Intuition ??? Gap

Possible solutions:

Long-term memory:

  • External knowledge base agent can read/write
  • This KB (MindCapsule) is literally this pattern
  • Agent writes learnings → persists across sessions

Feedback:

  • Structured "that was wrong because X" mechanism
  • Saved to memory, surfaced in future similar situations
  • Human corrects → agent remembers correction

Intuition (accumulated patterns):

  • Distill patterns from past sessions
  • Add to system prompt or skills
  • "Progressive hardening" of learned behavior

Practice:

  • Replay past mistakes in sandbox
  • Try different approaches
  • Learn without real-world consequences

When Should Agents Learn?

During deployment (online learning):

  • Agent behavior evolves in real-time
  • Risk: drift, hard to audit, unpredictable
  • Benefit: immediate adaptation

Between versions (offline learning):

  • Human reviews agent's proposed learnings
  • Updates prompt/config deliberately
  • Risk: slower adaptation
  • Benefit: controlled, auditable

Hybrid (propose → approve):

  • Agent proposes learnings from session
  • Human approves what becomes permanent
  • Balance of adaptation and control

This is probably the right answer for now. Agent suggests, human curates, approved patterns get baked in.

The Graduated Trust Model

Like a new employee:

  1. Training wheels: All actions require approval
  2. Supervised: Dangerous actions need approval, routine actions autonomous
  3. Trusted: Most actions autonomous, only irreversible actions flagged
  4. Expert: Full autonomy with audit log

Trust is earned through demonstrated competence, not assumed.

How to implement:

  • Track success/failure rate by action type
  • Automatically adjust approval requirements
  • Human can override trust level anytime

Open Questions

  • How do you define "similar situation" for retrieving past learnings?
  • What's the right granularity for feedback? (action-level? session-level? outcome-level?)
  • How do you prevent learned patterns from becoming stale?
  • Can agents learn from each other? (multi-agent knowledge sharing)

The Fundamental Memory Limitation

Current LLMs (including me) have a structural problem with learning:

What I do poorly:

  • Proactive memory: Don't automatically check knowledge base unless prompted
  • Importance weighting: Everything in context gets roughly equal attention
  • Nuance detection: Can't tell a passing comment from a core principle

What "weight" means to humans vs LLMs:

Human LLM
Emotional charge → remembers No emotional memory
High stakes → attention No felt consequences
Surprise → salience No expectation violation
Connected to many things → important Only if explicitly linked

What helps:

  • Explicit importance markers in docs ("Core principle:")
  • Structured summaries (top 3 things that matter)
  • Session start rituals (force reading key docs)
  • Repetition (important things in multiple places)
  • User saying "this is important" → note prominently

The hard truth: Each session reconstructs understanding from text, not recalling lived experience. This is the agent learning problem—we don't accumulate intuition.

Related