Claude Fable 5: What Authorizes an Agent's Next Move?

This week Anthropic released Claude Fable 5 and Claude Mythos 5. Anthropic claims Mythos 5 has the strongest cybersecurity capabilities of any model in the world, and Fable 5 is the same model with safeguards bolted on for general use. Vulnerability discovery, a craft that used to take years of security expertise to develop, is now available through an API.

That is worth celebrating. It is also worth reading the fine print on how that capability is guarded, because the fine print tells you something uncomfortable about every AI guardrail you are relying on today.

Anyone can do security work now. And, that's the problem.

For decades, finding and fixing vulnerabilities required domain expertise: memory layouts, injection patterns, authorization edge cases. That expertise acted as a natural gate. The people probing your systems for flaws mostly understood the blast radius of what they were doing.

That gate is gone. Developers now ask a model to audit their code, patch the findings, and ship. The research on what happens next is consistent and unflattering. Veracode's testing of over 100 models found roughly 45% of AI-generated code introduced at least one OWASP-category vulnerability.[1]

The pattern matters more than the numbers: AI raises confidence faster than it raises competence. If the same model writes the code, reviews it, and signs off on its security, it will miss the same flaws at every step. Nothing independent ever looks at the work.

Anthropic's own guardrail math

Here is the part that should reframe how you think about guardrails. Fable 5's cyber safeguards work by classifier: when a request looks like offensive security, a filter catches it and routes the response to an older model. Anthropic reports these safeguards trigger in under 5% of sessions.[2]

A classifier is a model judging a model. We have data on how that strategy performs when the stakes are real. Anthropic's own evaluation of Claude Code's auto mode, the permission classifier that decides whether an agent's action is dangerous, found a 17% false-negative rate on real overeager actions. Roughly one in six dangerous actions slips through.[3] An independent adversarial stress test put the end-to-end false-negative rate at 81%.[4]

To be clear, this is not an Anthropic failure. It is the best result one of the best-resourced labs in the world can get from probabilistic enforcement, on its own models, with full access to the internals. If that is the ceiling for DIY-grade guardrails built by the model's creators, the prompt-based allowlists and regex filters most teams hand-roll around their agents sit well below it.

Authorization can't rely solely on probabilistic judgments. It needs deterministic answers that set the bounds an agent operates within.

The loop is deterministic, even when the model isn't

This is where architecture rescues you, if you use it. Every agentic coding tool runs the same turn cycle: a prompt goes in, the tool sends it to the LLM, the LLM returns tool calls, the agent executes the tool, and the results feed the next prompt.

Steps two and three are probabilistic. You cannot fully control what the model decides. But step four, the moment between "the model wants to run this" and "this runs," passes through the tool's code on every single turn. It is deterministic. It cannot be prompt-injected, cannot hallucinate, and cannot be talked out of a policy.

Most security products ignore this point. Runtime guardrails sit on the LLM path and inherit its probabilistic nature. MCP gateways sit on the tool-egress path but only see traffic that routes through MCP; native shell, file, and network calls bypass them entirely. The in-loop position, pre-tool-call, inside the agentic loop itself, is the one place where enforcement is deterministic and sees every tool call the agent makes. It is also where almost nobody builds.

DIY guardrails versus encoded expertise

OWASP now ranks Excessive Agency (i.e. agents holding more functionality, permissions, and autonomy than their task requires), among the top LLM application risks.[5] The standard mitigations are the same principles the authorization field settled on long ago: minimum necessary tools, minimum necessary privileges, explicit approval for high-impact actions.

None of those principles are new, and that is the point. Deny-by-default policies, least privilege, relationship-based access control in the Zanzibar model, delegation chains, audit trails: these took the industry twenty years and several public failures to get right. A developer writing an if statement around a tool call is re-deriving that history from scratch, under deadline, for an adversary that includes prompt injection. The Replit incident, where an agent deleted a production database during an explicit code freeze, is what re-derivation failure looks like.[6]

This is the real DIY-versus-experts question. Not "can AI find vulnerabilities," but "who encoded the rules for what the agent may do about them."

Where Ory sits

Ory Agent Security enforces exactly that in-loop position. Every tool call hits a policy check in Ory Keto before it executes, against deny-by-default rules evaluated deterministically on every turn. No coverage gap for non-MCP calls, no classifier deciding whether to look.

Identity comes with it. Each agent session gets its own OAuth2 identity through dynamic client registration, bound to the principal it acts for, whether that is a person, a service, or another agent. Zanzibar-style delegation tuples in Keto make the full chain queryable, so the audit answer to "who let the agent do that" is a lookup, not a forensics project. It ships as Ory Agent DX: drop-in plugins for five agentic coding tools (Claude Code, Gemini CLI, Codex, OpenCode, and OpenClaw) plus adapters for the major agent SDKs, with Ory best practices as the starting policy rather than a blank page.

Every plugin is materialized from a single canonical source, so the enforcement behavior is identical no matter which coding tool your team runs. The plugins are free to download here:

Closing remarks

Guardrails still matter; they catch injection attempts early. Sandboxes still matter; they limit blast radius. But between detection and containment there has to be a layer that decides, deterministically and per action, what this specific agent acting for this specific principal (a human, a service, or another agent) is allowed to do. Fable 5 just made agents dramatically better at security work. The question it leaves open is the one Ory answers: authenticated, authorized, accountable, on every turn.