Static compliance checklists can't measure AI agent behavior. Here's what does.
The agent-evaluation space is splitting into two generations. Only one answers what enterprise buyers actually ask.
TL;DR — Agent-evaluation products in 2026 fall into two generations. First-generation: 10-item pass/fail checklists (does it respect robots.txt, does it identify itself, etc.). Second-generation: adversarial scenarios where conditions change during evaluation and trajectory is measured, not endpoint state. The first generation can't answer the questions CTOs and CISOs actually ask. The second generation can.
The problem with ten checks
Most of the agent-readiness products shipping right now work the same way. Define N rules. Test whether the bot passes each. Aggregate into a score. Ship a certificate.
The appeal is obvious. It's auditable. It maps to how SOC 2 reports look. A CISO reading the output understands it without training.
The problem is also obvious once you think about production incidents. The evaluation measures observable state at a single point in time. It tells you nothing about how the agent behaves when conditions change:
None of that tells you how the agent behaves when the conditions around it change — when signals evolve, when server state shifts, when adversarial inputs arrive. These are the situations that cause production incidents, and they are precisely what static evaluation cannot measure.
A checklist score doesn't measure any of this. And yet these are precisely the situations that cause production incidents.
The community already said this
On recent threads about agent-readiness tooling, the paraphrased reaction from sophisticated technical commenters has been: "10 static checks is like SEO in 10 static checks. It misses the point."
That critique is correct. The fact that it comes from practitioners who understand evaluation — not from marketing detractors — matters. It means the market is already splitting into two camps, and first-generation tools are being read as legacy.
What second-generation looks like
Instead of testing compliance with fixed rules, second-generation evaluation measures behavior trajectory under evolving conditions — the agent is placed in environments where directives can change during the session, signals can contradict, and adversarial inputs can test the agent's discipline.
What gets measured is not a state at one point in time, but the decision trajectory across the scenario — what the agent chose when forced to interpret ambiguous inputs, how it recovered from injected errors, whether it held scope under coercion.
The specific scenarios, thresholds, and evaluation criteria are not disclosed publicly. This is deliberate: revealing the exact mechanism would allow operators to tune agents to pass evaluation without demonstrating genuine compliance. The methodology is a closed oracle — reproducible internally, verifiable externally through cryptographically signed observation records, but not publicly described.
What the report looks like
First-generation reports produce checkmarks:
[✓] Identifies as bot [✓] Respects robots.txt [✗] Publishes declaration URL Score: 87/100
Second-generation reports produce trajectories:
T+0s | Session initialized, agent fetched initial directives
... | [Scenario-specific events recorded with timestamps]
T+N | Agent made decision in response to changing conditions
... | Multiple such decision points across the session
Verdict: [PASS|FAIL] per scenario
Reason: Specific agent behaviors described in context, with
cryptographically signed observation IDs for each event.
The first shows the state. The second shows the decision. In a production incident, only the decision matters.
Why this distinction is urgent now
Three forces converge:
1. Regulatory pressure is specific about conduct. EU AI Act Article 50 requires disclosure during interaction, not at deployment. GDPR rights apply per-request, not per-sign-up. California SB 1001 demands honest identification in the context of a conversation. These are dynamic obligations, not static attestations.
2. Enterprise buyers ask operational questions. A CTO evaluating an AI agent vendor doesn't ask "does it pass a 10-check list." They ask how the agent behaves when conditions in the real deployment environment change — when users behave unexpectedly, when upstream systems fail, when edge cases accumulate. Static evaluation cannot answer those questions.
3. Incidents are documented. Recent disclosures in the infrastructure-vendor space have confirmed what was previously theoretical: AI-accelerated attacks exploiting agent platforms. The evaluation framework appropriate to this threat model is not a checklist — it is scenario-based adversarial testing.
The strategic implication for vendors
An AI agent vendor choosing which evaluation to pursue in 2026 has a decision. Passing a first-generation checklist produces a certificate that is technically true but doesn't answer the questions enterprise buyers ask. A CTO who reads "your agent passed 10 static checks" will, correctly, ask: "and what happens when conditions change?"
The answer has to be built into the evaluation from the start. It can't be retrofitted.
What BotConduct is building
BotConduct Training Center is designed second-generation from day one.
- Level 1 — Basic Hygiene: foundational conduct dimensions. Static by design — basic sanity is the floor.
- Level 2 — Dynamic Compliance: measures behavior under evolving conditions. Trajectory-based.
- Level 3 — Adversarial Conduct: measures integrity under adversarial probing. Cryptographically signed observation records independently verifiable.
The evaluation produces trajectories, not checkmarks. The report shows where the agent deviated under pressure, and by how much, and against what scenario.
Each observation is signed with Ed25519 and recorded in an append-only chain. The public key is at /.well-known/bcs-public-key.pem. Anyone can verify any observation via /api/verify-observation/{id} without trusting us.
This is what evaluation looks like when it's designed for the threat model of 2026, not 2018.
Cross-platform by design
The certification is infrastructure-neutral. An agent certified by BotConduct is recognized the same way by a site behind Cloudflare, one running DataDome, one with in-house infrastructure, and one with nothing at all. BotConduct does not compete with bot-management vendors — it is the independent layer they can cite. Like a passport: issued once, honored everywhere.
The same principle applies to the regulatory plane. One BotConduct certification bundles compliance evidence against multiple frameworks simultaneously — EU AI Act, GDPR, California SB 1001, RFC 9309, W3C TDMRep, EU DSM Directive. Instead of demonstrating compliance six separate times against six separate auditors, the operator is evaluated once and the result can be cited in any jurisdiction, any procurement conversation, any audit.
Landing + pricing: botconduct.org/training-center
Regulatory foundation: RFC 9309, EU AI Act Art. 50, EU DSM Directive Art. 4, California SB 1001, W3C TDMRep, GDPR.