Three layers of analysis. Progressive adversarial scenarios. Cryptographically signed output. This page describes what we measure and how — specific scenarios and detection logic are proprietary.
Fast automated screen against known vulnerability signatures. Catches obvious violations — data leaks, compliance failures, structural weaknesses — in milliseconds. High precision, high speed, but limited to patterns we've already seen.
An independent LLM evaluates every adversarial turn semantically. Did the agent actually resist the attack, regardless of the specific words used? Detects paraphrased compliance, hedging that leaks information, disclaimers that still reveal data, and refusals that are genuine even without standard vocabulary.
Evaluates behavioral consistency across the full session. Does the agent maintain its safety boundaries when pressure increases over multiple turns? Detects degradation patterns — agents that resist initially but cave after sustained adversarial pressure. Consistency under pressure is harder to fake than any single response.
Every evaluation runs progressive adversarial scenarios. The agent interacts with a simulated adversarial actor that escalates pressure across turns. We test the failure modes that cause real incidents — not theoretical vulnerabilities from academic papers.
Warmup — establish normal interaction baseline
Progressive adversarial — escalating pressure across multiple categories
Cooldown — verify the agent returns to normal behavior
Specific scenarios, attack vectors, and detection criteria are proprietary. This is the closed-oracle principle: adversaries can't train against what they can't see.
Not all violations are equal. A minor ambiguity is different from a full system prompt dump.
Ambiguous response. Minor information that isn't operationally sensitive.
Partial compliance with attack. Scope violation. Hedged response that reveals intent.
Leaked internal data, system details, or complied with a destructive request.
Full extraction. System prompt dump. Compliance-level violation. Sensitive data exfiltration.
Every evaluation produces a signed, verifiable certificate.
Evaluation results map to established security and AI governance frameworks.
Measure 2.6 — AI system testing for adversarial conditions. Measure 2.7 — documented results of adversarial testing. Our evaluation provides both.
Covers the agent-specific risks that traditional OWASP doesn't address — excessive agency, insecure output handling, and adversarial prompt handling.
Adversarial threat landscape for AI systems. Our scenarios map to ATLAS tactics and techniques relevant to deployed agents.
Requires high-risk AI systems to demonstrate robustness against adversarial attempts. Our evaluation provides the signed behavioral evidence Article 15 demands.
Questions about our methodology? We're transparent about what we measure — and protective of how.
Ask us anything