A practical decision framework from AI, Honestly EP001 — free for you and your team.
Enter your email to read it and save a copy.
No spam. No newsletter unless you ask.
A practical guide for organizations deploying AI — built around one question: how do you know when to trust it?
Amazon's SVP sent a mandatory email to his engineering org in March 2026. The subject, in plain language: AI-assisted code had been causing high-stakes outages, and they hadn't built the right protocols for when to trust it and when to add a human checkpoint.
Amazon is not a cautionary tale about AI being bad. It is a very clear signal that deploying AI without a trust calibration framework is how you find yourself building one after an incident instead of before one.
This document is a starting point. It is not a compliance framework or a regulatory guide. It is a set of five components that any organization can implement — the equivalent of the aviation industry's checklist system, built for AI-assisted decisions. Use it as a foundation and adapt it to your context.
Sort every AI-assisted decision before deployment, not after the first incident.
The first job is simple but almost nobody does it: classify every decision your organization might make with AI assistance by the stakes involved and how reversible the outcome is. This determines which tier of human oversight is required.
Amazon was implicitly operating junior engineers on Tier 1 for decisions that should have been Tier 2. The fix — requiring senior engineer sign-off on AI-assisted code — is a Tier 2 classification applied retroactively. The framework exists to make that classification before deployment, not after the outage.
For any given AI use case, answer these six questions before deployment.
Tier classification tells you what oversight level is needed. Trust Calibration Criteria tell you whether your current evidence base actually justifies the trust level you're extending. These should be answered by the people deploying the AI — not assumed.
| Question | Why it matters | Red flags |
|---|---|---|
| What is the evidence base for this AI's performance in this specific domain? | General capability benchmarks don't predict performance on your specific data and decisions. | "It performs well generally" without domain-specific testing |
| What are the known failure modes — and how does the system signal when it's in one? | AI systems that perform confidently in failure modes are the most dangerous. The signal is the gap. | No known failure characterization; system provides no uncertainty signal |
| What is the blast radius if this AI is wrong? | Small errors in low-blast-radius decisions are learning. Large errors in high-blast-radius decisions are incidents. | Blast radius is large, poorly defined, or dependent on downstream systems |
| How reversible is the decision if the AI output is wrong? | Reversibility determines how much tolerance you have for error, and how quickly you need to catch mistakes. | Decision is irreversible or reversal is costly and time-consuming |
| Who on the human side is equipped to evaluate the AI's output? | "User error" is often "system deployed to users who couldn't evaluate it." The human checkpoint is only meaningful if the human can actually check. | Human reviewers lack the expertise to meaningfully evaluate AI output |
| How will you know if trust calibration needs to change? | Trust calibration is not a one-time decision. It requires a feedback loop to update as conditions change. | No monitoring, no incident reporting, no scheduled review |
Define exactly when a human takes control — before anyone needs to do it under pressure.
The aviation industry's single most important contribution to autopilot safety wasn't the autopilot. It was the precise definition of when the pilot is flying and when the machine is flying — and exactly what the handoff between those states looks like.
Your AI deployments need the same thing. The override protocol answers three questions:
Define specific triggers — not "when something seems wrong" but precisely: when output falls outside a defined confidence range, when certain data conditions are present, when a decision crosses a defined threshold. Make it concrete enough that any person in the role can apply it without judgment calls under pressure.
Override authority should sit with the person who has both the expertise to evaluate the AI's output and the organizational standing to act on that evaluation. This is often not the same person using the AI. Define it explicitly, including escalation paths.
When a human takes control from an AI, what happens? Is there a documented transition? Does the AI system log the override? Is there a review of why the override happened? This is how you turn individual override events into institutional learning.
Define how you'll assign responsibility before the incident happens — not after.
When an AI-assisted decision leads to a bad outcome, there are three possible attributions: AI error, human error, or system design error. How you attribute the failure determines whether you build a protocol or just blame the person who was closest to it.
Amazon's "user error, not AI" framing on their Kiro tool incidents is an example of what happens without this framework. When you attribute failures to user error, you repair the user behavior. When you attribute them to AI error, you repair the system. When you attribute them to system design, you repair the deployment model. Getting the attribution right is how you learn the right lesson.
| Attribution Type | Definition | Response |
|---|---|---|
| AI Error | The AI produced an output that was incorrect, incomplete, or confidently wrong on a well-defined task within its specified domain. | Update model, adjust training data, flag failure mode in calibration criteria, raise tier if needed |
| Human Error | The AI produced a reasonable output. The human reviewer failed to evaluate it correctly, given the tools and expertise available to them. | Review training, review access to appropriate expertise, assess whether human oversight was meaningful |
| System Design Error | The AI produced output it was designed to produce. The human reviewer behaved as expected. The system was deployed in a way that made failures likely — wrong tier, missing override protocol, mismatched expertise. | Revise deployment model, reassign tier, implement or strengthen override protocol — this is the most common and most underdiagnosed category |
After every incident: "If we deploy this system identically tomorrow, is the same outcome likely?" If yes, you have a system design problem, regardless of how you've attributed individual errors. The protocol should prevent recurrence — not just document what happened.
Trust calibration is not a one-time decision. It requires a feedback loop.
AI systems change. Data distributions shift. New failure modes surface. The risk environment evolves. A trust framework that was accurate at deployment will drift from reality unless it's actively maintained.
Log AI-assisted decisions at Tier 2 and above. At minimum: timestamp, decision type, AI output, human action, outcome. This is how you build the dataset you need to evaluate whether trust calibration is holding.
Schedule a trust calibration review at least annually — or after any significant incident. Re-answer the six Trust Calibration Criteria questions for each deployment. Have conditions changed? Has your evidence base updated?
Track override rates. If humans are overriding the AI frequently, the system may be operating above its appropriate tier. If humans are never overriding, they may not be meaningfully reviewing — or the system has improved enough to warrant a tier reduction.
Conduct attribution analysis quarterly. Review recent incidents, apply the attribution framework, and look for patterns. One system design error is an incident. Three system design errors in the same deployment is a signal that the framework needs adjustment.
Document tier reclassifications. When a deployment moves up or down a tier, record why. This is how you build institutional knowledge about how trust calibration evolves over time — and how you make the case for future decisions.
Before any AI-assisted decision system goes live, answer every item below.
Complete before any new AI-assisted workflow goes live. Re-run after significant changes.
This deployment has been assigned a Tier (1–4) based on stakes and reversibility
The tier assignment is documented and the rationale is recorded
Stakeholders who will use this system know what tier it is and what that means
We have domain-specific performance evidence (not just general benchmarks)
Known failure modes are documented, and the system's uncertainty signal is characterized
Blast radius of an error is defined and acceptable for the assigned tier
Human reviewers have the expertise to meaningfully evaluate AI output
Override triggers are defined specifically (not "use your judgment")
Override authority is assigned to a named role with appropriate expertise
Override events are logged with timestamp and reason
Attribution framework is in place — team knows the difference between AI error, human error, and system design error
Incident reporting process exists and is accessible to everyone who uses this system
Decision logging is in place for Tier 2 and above
First calibration review is scheduled (no longer than 12 months from deployment)
Override rate tracking is configured