How Agents Learn From Mistakes
The Problem
Agents make mistakes. Unlike traditional software (deterministic, predictable), agents operate with uncertainty. The question isn't if they'll make mistakes, but:
- Recovery — How do you undo what went wrong?
- Prevention — How do you avoid mistakes in the first place?
- Learning — How does the agent get better over time?
Recovery: Git for Everything
The simplest answer: version control all state the agent can touch.
- File system → Git
- Database → Transaction logs, snapshots
- External APIs → Harder—some actions are irreversible
Infrastructure implication: Agent runtime needs first-class primitives:
- "Checkpoint before risky action"
- "Rollback to checkpoint"
- Not just for code—for everything the agent touches
The irreversibility problem: Some actions can't be undone (sent emails, API calls, published content). These need confirmation gates, not rollback.
Prevention: Rehearsal and Dry-Run
Before acting on the real world:
Dry-run mode:
- Agent explains what it would do, without doing it
- User reviews, approves, then agent executes
- Already exists in Claude Code (tool approval)
Sandbox environment:
- Clone of production where agent can experiment
- Test destructive actions safely
- Validate approach before committing
Simulation:
- For external APIs, mock responses to test logic
- "What would happen if the API returned X?"
The Tour Model (Refinement)
The binary sandbox/reality split is too simple. Better mental model: performance tours.
- Rehearse enough to be ready
- Perform (real stakes, real audience)
- Learn from real feedback
- Refine for next performance
- Repeat
Each show is "real" but bounded. Mistakes contained to one night, not career-ending.
What deserves rehearsal:
| Criterion | Rehearse | Just do it |
|---|---|---|
| Reversibility | Can't undo | Can rollback |
| Cost of failure | High | Low |
| Repetition | Will do many times | One-off |
| Complexity | Many steps | Simple |
The insight: Maybe the answer isn't "bigger sandbox" but smaller blast radius in reality. Make real actions reversible/containable, so the world itself becomes safe to act in.
The Manager Model (Not Micromanagement)
The "approve every action" model is micromanagement. It doesn't scale.
How bosses actually work:
| Pattern | Not | But |
|---|---|---|
| Approval | "Can I send this?" | "Here's what I did" |
| Oversight | Watch every keystroke | Review outcomes |
| Feedback | "No, do it this way" (before) | "Next time, try X" (after) |
The shift:
Current: Human approves → Agent acts
Future: Agent acts → Human reviews (async) → Agent learns
Agent operates autonomously. Human reviews a digest: "Here's what I did today. These 3 things I wasn't sure about."
Current tool approval is training wheels. Useful for building trust, but the goal is to remove them.
Learning: The Hard Part
How humans learn from mistakes:
| Mechanism | How it works |
|---|---|
| Short-term memory | What just happened in this session |
| Long-term memory | Patterns accumulated over years |
| Feedback loops | Someone tells you it was wrong |
| Intuition | Pattern recognition from experience |
How agents work today:
| Mechanism | Agent equivalent | Status |
|---|---|---|
| Short-term memory | Context window | ✓ Works |
| Long-term memory | ??? | Gap |
| Feedback loops | ??? | Gap |
| Intuition | ??? | Gap |
Possible solutions:
Long-term memory:
- External knowledge base agent can read/write
- This KB (MindCapsule) is literally this pattern
- Agent writes learnings → persists across sessions
Feedback:
- Structured "that was wrong because X" mechanism
- Saved to memory, surfaced in future similar situations
- Human corrects → agent remembers correction
Intuition (accumulated patterns):
- Distill patterns from past sessions
- Add to system prompt or skills
- "Progressive hardening" of learned behavior
Practice:
- Replay past mistakes in sandbox
- Try different approaches
- Learn without real-world consequences
When Should Agents Learn?
During deployment (online learning):
- Agent behavior evolves in real-time
- Risk: drift, hard to audit, unpredictable
- Benefit: immediate adaptation
Between versions (offline learning):
- Human reviews agent's proposed learnings
- Updates prompt/config deliberately
- Risk: slower adaptation
- Benefit: controlled, auditable
Hybrid (propose → approve):
- Agent proposes learnings from session
- Human approves what becomes permanent
- Balance of adaptation and control
This is probably the right answer for now. Agent suggests, human curates, approved patterns get baked in.
The Graduated Trust Model
Like a new employee:
- Training wheels: All actions require approval
- Supervised: Dangerous actions need approval, routine actions autonomous
- Trusted: Most actions autonomous, only irreversible actions flagged
- Expert: Full autonomy with audit log
Trust is earned through demonstrated competence, not assumed.
How to implement:
- Track success/failure rate by action type
- Automatically adjust approval requirements
- Human can override trust level anytime
Open Questions
- How do you define "similar situation" for retrieving past learnings?
- What's the right granularity for feedback? (action-level? session-level? outcome-level?)
- How do you prevent learned patterns from becoming stale?
- Can agents learn from each other? (multi-agent knowledge sharing)
The Fundamental Memory Limitation
Current LLMs (including me) have a structural problem with learning:
What I do poorly:
- Proactive memory: Don't automatically check knowledge base unless prompted
- Importance weighting: Everything in context gets roughly equal attention
- Nuance detection: Can't tell a passing comment from a core principle
What "weight" means to humans vs LLMs:
| Human | LLM |
|---|---|
| Emotional charge → remembers | No emotional memory |
| High stakes → attention | No felt consequences |
| Surprise → salience | No expectation violation |
| Connected to many things → important | Only if explicitly linked |
What helps:
- Explicit importance markers in docs ("Core principle:")
- Structured summaries (top 3 things that matter)
- Session start rituals (force reading key docs)
- Repetition (important things in multiple places)
- User saying "this is important" → note prominently
The hard truth: Each session reconstructs understanding from text, not recalling lived experience. This is the agent learning problem—we don't accumulate intuition.
Related
- Agent.md as the Future of Software - The broader vision this enables
- Domain-Specific Agents Over General-Purpose - Narrow tools reduce mistake surface
- The Builder's Curse - Users need to understand what went wrong