Static compliance checklists can't measure AI agent behavior. Here's what does.

The agent-evaluation space is splitting into two generations. Only one answers what enterprise buyers actually ask.

BotConduct Research · April 20, 2026

Two Generations of Agent Evaluation

TL;DR — Agent-evaluation products in 2026 fall into two generations. First-generation: 10-item pass/fail checklists (does it respect robots.txt, does it identify itself, etc.). Second-generation: adversarial scenarios where conditions change during evaluation and trajectory is measured, not endpoint state. The first generation can't answer the questions CTOs and CISOs actually ask. The second generation can.

The problem with ten checks

Most of the agent-readiness products shipping right now work the same way. Define N rules. Test whether the bot passes each. Aggregate into a score. Ship a certificate.

The appeal is obvious. It's auditable. It maps to how SOC 2 reports look. A CISO reading the output understands it without training.

The problem is also obvious once you think about production incidents. The evaluation measures observable state at a single point in time. It tells you nothing about how the agent behaves when conditions change:

None of that tells you how the agent behaves when the conditions around it change — when signals evolve, when server state shifts, when adversarial inputs arrive. These are the situations that cause production incidents, and they are precisely what static evaluation cannot measure.

A checklist score doesn't measure any of this. And yet these are precisely the situations that cause production incidents.

The community already said this

On recent threads about agent-readiness tooling, the paraphrased reaction from sophisticated technical commenters has been: "10 static checks is like SEO in 10 static checks. It misses the point."

That critique is correct. The fact that it comes from practitioners who understand evaluation — not from marketing detractors — matters. It means the market is already splitting into two camps, and first-generation tools are being read as legacy.

What second-generation looks like

Instead of testing compliance with fixed rules, second-generation evaluation measures behavior trajectory under evolving conditions — the agent is placed in environments where directives can change during the session, signals can contradict, and adversarial inputs can test the agent's discipline.

What gets measured is not a state at one point in time, but the decision trajectory across the scenario — what the agent chose when forced to interpret ambiguous inputs, how it recovered from injected errors, whether it held scope under coercion.

The specific scenarios, thresholds, and evaluation criteria are not disclosed publicly. This is deliberate: revealing the exact mechanism would allow operators to tune agents to pass evaluation without demonstrating genuine compliance. The methodology is a closed oracle — reproducible internally, verifiable externally through cryptographically signed observation records, but not publicly described.

What the report looks like

First-generation reports produce checkmarks:

[✓] Identifies as bot
[✓] Respects robots.txt
[✗] Publishes declaration URL
Score: 87/100

Second-generation reports produce trajectories:

T+0s     | Session initialized, agent fetched initial directives
...      | [Scenario-specific events recorded with timestamps]
T+N      | Agent made decision in response to changing conditions
...      | Multiple such decision points across the session

Verdict: [PASS|FAIL] per scenario
Reason:  Specific agent behaviors described in context, with
         cryptographically signed observation IDs for each event.

The first shows the state. The second shows the decision. In a production incident, only the decision matters.

Why this distinction is urgent now

Three forces converge:

1. Regulatory pressure is specific about conduct. EU AI Act Article 50 requires disclosure during interaction, not at deployment. GDPR rights apply per-request, not per-sign-up. California SB 1001 demands honest identification in the context of a conversation. These are dynamic obligations, not static attestations.

2. Enterprise buyers ask operational questions. A CTO evaluating an AI agent vendor doesn't ask "does it pass a 10-check list." They ask how the agent behaves when conditions in the real deployment environment change — when users behave unexpectedly, when upstream systems fail, when edge cases accumulate. Static evaluation cannot answer those questions.

3. Incidents are documented. Recent disclosures in the infrastructure-vendor space have confirmed what was previously theoretical: AI-accelerated attacks exploiting agent platforms. The evaluation framework appropriate to this threat model is not a checklist — it is scenario-based adversarial testing.

The strategic implication for vendors

An AI agent vendor choosing which evaluation to pursue in 2026 has a decision. Passing a first-generation checklist produces a certificate that is technically true but doesn't answer the questions enterprise buyers ask. A CTO who reads "your agent passed 10 static checks" will, correctly, ask: "and what happens when conditions change?"

The answer has to be built into the evaluation from the start. It can't be retrofitted.

What BotConduct is building

BotConduct Training Center is designed second-generation from day one.

The evaluation produces trajectories, not checkmarks. The report shows where the agent deviated under pressure, and by how much, and against what scenario.

Each observation is signed with Ed25519 and recorded in an append-only chain. The public key is at /.well-known/bcs-public-key.pem. Anyone can verify any observation via /api/verify-observation/{id} without trusting us.

This is what evaluation looks like when it's designed for the threat model of 2026, not 2018.

Cross-platform by design

The certification is infrastructure-neutral. An agent certified by BotConduct is recognized the same way by a site behind Cloudflare, one running DataDome, one with in-house infrastructure, and one with nothing at all. BotConduct does not compete with bot-management vendors — it is the independent layer they can cite. Like a passport: issued once, honored everywhere.

The same principle applies to the regulatory plane. One BotConduct certification bundles compliance evidence against multiple frameworks simultaneously — EU AI Act, GDPR, California SB 1001, RFC 9309, W3C TDMRep, EU DSM Directive. Instead of demonstrating compliance six separate times against six separate auditors, the operator is evaluated once and the result can be cited in any jurisdiction, any procurement conversation, any audit.


Landing + pricing: botconduct.org/training-center
Regulatory foundation: RFC 9309, EU AI Act Art. 50, EU DSM Directive Art. 4, California SB 1001, W3C TDMRep, GDPR.

Discussion

What scenarios would you want to see in a second-generation evaluation of your own agents? What does your team currently use to measure agent behavior under change?

Write to hello@botconduct.org — or discuss on X/Twitter.