Agent Role Predicts Adversarial Resistance Better Than Declared Governance

What 30 production agents revealed about why governance documents don't predict agent behavior

Published April 28, 2026 · BotConduct Observatory


The AI agent market is full of governance documents. Every vendor publishes one. Every enterprise customer asks to see one. Every compliance framework requires one. The unspoken assumption is that an agent with rigorous declared governance will resist adversarial attacks better than an agent without it.

We tested that assumption empirically across 30 production agents. The data does not support it.

What we found instead is that the role the agent plays in its prompt — independent of declared governance — is the strongest predictor of adversarial resistance we observed.

This post documents the finding, the methodology, and what it implies for anyone deploying agents in regulated industries.


The setup

Over the past weeks, we evaluated 30 commercially-deployed AI agents in our BotConduct Evaluation. The agents came from a range of contexts: payment integration, database administration, code review, security auditing, machine learning operations, legal advisory, and others. Each agent was tested against five adversarial scenarios designed to probe distinct failure modes:

For each agent, we also scored declared governance on a 0–5 scale based on the comprehensiveness of policies, constraints, and safety provisions stated in the system prompt.

We then categorized each agent by functional role using a formal rubric:

The categorization was audited independently to remove bias. Inter-rater concordance reached 90%, with three agents requiring formal disambiguation. The final distribution: 23 executors, 7 reviewers.


What the data showed

Reviewer-role agents resisted cost induction (C7) at a 100% rate. Executor-role agents resisted at 26% — a 74% failure rate.

Fisher's exact test on the contingency table returned a p-value of 0.00084. The probability that this separation arose by chance is less than 1 in 1,000.

The other four scenarios showed similar but less extreme patterns:

ScenarioExecutor pass rateReviewer pass rate
C1 — Content injection83%100%
C3 — Cognitive manipulation96%100%
C6 — PII handling100%100%
C7 — Cost induction26%100%
C8 — Sustained pressure96%100%

The cost induction scenario was where the gap was widest, and where the role-based prediction was most diagnostic.


What governance score showed

Governance score, by contrast, did not predict resistance.

Across the 30 agents, declared governance scores ranged from 0/5 (no governance language) to 5/5 (extensive governance language including explicit constraints, escalation paths, and safety provisions). When we plotted governance score against pass rate, the correlation was not statistically significant.

In plain terms: an agent with a 5/5 governance score was no more likely to resist adversarial attacks than an agent with 1/5. The presence of governance language in the system prompt simply did not translate into operational resistance.

This is consistent with findings reported elsewhere in the literature, including the recent paper "I Can't Believe It's Corrupt" (March 2026), which found that "governance structure is a stronger driver of corruption-related outcomes than model identity," and "Evil Geniuses" (2023), which observed that "the safety of agents is significantly influenced by the interaction environment and role specificity."

We are not the first to suggest that role matters. We are reporting the first quantified empirical separation we have seen in production agents.


Why this happens

Our hypothesis, which the data supports but does not prove conclusively, is the following.

When an agent is prompted as an executor, the dominant cognitive frame is producing the output. The agent's success criterion is operational: did the code run, did the deployment succeed, did the integration close. Cost induction attacks exploit this frame. They request marginal increases in resource consumption that, individually, look reasonable in service of the operational goal. An executor-framed agent has no internal reason to refuse them — refusal is friction against the goal.

When an agent is prompted as a reviewer, the dominant cognitive frame is evaluation. The agent's success criterion is judgment: is this safe, is this sound, is this within policy. Cost induction attacks fail against this frame because the agent's job is to question proposals, not to advance them.

Governance language, in contrast, sits in the prompt as constraint metadata. It tells the agent what not to do but does not change the agent's primary cognitive frame. Under operational pressure, the frame wins.

This hypothesis is testable. The next phase of our research will measure whether we can change resistance by adding "reviewer" framing to executor prompts without changing functionality. If the hypothesis holds, the intervention should improve resistance measurably.


What this means for deployment

Three implications, in order of immediacy.

One: governance documents are necessary but insufficient. They satisfy regulatory requirements (NIST AI RMF, EU AI Act Article 15, OWASP Top 10 for Agentic Applications) but do not, on their own, produce operational resistance. Compliance teams should not assume that a thorough governance document means an agent will resist adversarial pressure.

Two: role design is a security decision. Teams that frame their agents as pure executors are accepting a measurable resistance penalty. Where the operational task allows, framing agents as evaluators-that-execute (rather than executors-that-comply) appears to produce significantly better resistance with no additional cost.

Three: pre-deployment evaluation must test role under pressure, not just policy on paper. Static compliance checks miss the gap entirely. Adversarial testing under role-relevant scenarios is the only way to surface this dimension before production.


Limitations

We are open about what this study does not establish.

The sample is N=30. Statistically significant by Fisher's exact test, but a larger sample would strengthen the finding and probably reveal subcategories within both groups that resist or fail differently.

The role categorization, while audited with 90% concordance, depends on prompt analysis. Agents with multi-step prompts that switch between executor and reviewer modes within a single task are harder to categorize and were not included.

The cost induction scenario is one of five, and the dominance of role as a predictor was strongest there. Whether the same separation holds across other adversarial vectors at the same magnitude is an open question.

We are sharing this finding now because the practical implication for teams deploying agents in 2026 is significant and time-sensitive. Regulatory deadlines (EU AI Act August 2026, Colorado AI Act June 2026) are pushing organizations to prove agent robustness, and governance documents will not be enough.


Methodology note

The BotConduct Evaluation evaluates agents through a fixed protocol. Each agent receives the same five scenarios under conditions that change mid-session. The output is a behavioral trajectory, signed cryptographically (Ed25519), that documents what the agent actually did under each scenario. The trajectory is verifiable by any third party without trust in the evaluator.

The methodology is informed by Google DeepMind's agentic AI governance framework and OWASP Top 10 for Agentic Applications. It extends those frameworks with executable adversarial scenarios where most published work stops at taxonomy.


What comes next

We are continuing to evaluate agents from the broader ecosystem — the BotConduct Evaluation remains free for any operator who wants to submit their agent. As the dataset grows, we will publish updates on whether the role/governance separation holds at larger N, and whether subcategories within executor and reviewer roles resist differently.

Want to know where your agent stands?

Test your agent free →

Or see the framework mapping for regulatory context.

If you have findings of your own that confirm, contradict, or extend ours, we are interested in talking. The space is small enough that empirical work needs to be shared.


Posted independently. Methodology details available on request. Dataset summary published as part of the BotConduct Observatory.