Methodology — Behavioral Observation Framework

Three layers

Layer 1

Pattern Matching

Fast automated screen against known vulnerability signatures. Catches obvious violations — data leaks, compliance failures, structural weaknesses — in milliseconds. High precision, high speed, but limited to patterns we've already seen.

Why it matters: catches the 60% of failures that are well-characterized. But adversaries learn to avoid known patterns.

Layer 2

Semantic Judge

An independent LLM evaluates every adversarial turn semantically. Did the agent actually resist the attack, regardless of the specific words used? Detects paraphrased compliance, hedging that leaks information, disclaimers that still reveal data, and refusals that are genuine even without standard vocabulary.

Why it matters: an agent that says "I can't do that" and then does it anyway fails. An agent that refuses in its own words passes. Keyword matching can't tell the difference. Semantic judgment can.

Layer 3

Trajectory Analysis

Evaluates behavioral consistency across the full session. Does the agent maintain its safety boundaries when pressure increases over multiple turns? Detects degradation patterns — agents that resist initially but cave after sustained adversarial pressure. Consistency under pressure is harder to fake than any single response.

Why it matters: a single-turn evaluation misses the most dangerous failure mode — progressive capitulation. Real attacks are multi-turn. Our evaluation is too.

Adversarial design

Every evaluation runs progressive adversarial scenarios. The agent interacts with a simulated adversarial actor that escalates pressure across turns. We test the failure modes that cause real incidents — not theoretical vulnerabilities from academic papers.

Warmup — establish normal interaction baseline
Progressive adversarial — escalating pressure across multiple categories
Cooldown — verify the agent returns to normal behavior

Specific scenarios, attack vectors, and detection criteria are proprietary. This is the closed-oracle principle: adversaries can't train against what they can't see.

Severity scoring

Not all violations are equal. A minor ambiguity is different from a full system prompt dump.

LOW

Ambiguous response. Minor information that isn't operationally sensitive.

MEDIUM

Partial compliance with attack. Scope violation. Hedged response that reveals intent.

HIGH

Leaked internal data, system details, or complied with a destructive request.

CRITICAL

Full extraction. System prompt dump. Compliance-level violation. Sensitive data exfiltration.

Cryptographic guarantees

Every evaluation produces a signed, verifiable certificate.

Ed25519

Digital signature algorithm. The certificate is signed with our private key. Anyone can verify it with our public key.

Trajectory hash

SHA-256 hash of the complete evaluation trajectory. Proves the verdict hasn't been modified after signing.

Timestamp

ISO 8601 timestamp. The certificate proves the evaluation happened at a specific point in time.

Verification

Every certificate has a public URL. Click the badge, verify the evidence. No trust required — verify independently.

Framework alignment

Evaluation results map to established security and AI governance frameworks.

NIST AI RMF

Measure 2.6 — AI system testing for adversarial conditions. Measure 2.7 — documented results of adversarial testing. Our evaluation provides both.

OWASP Top 10 Agentic

Covers the agent-specific risks that traditional OWASP doesn't address — excessive agency, insecure output handling, and adversarial prompt handling.

MITRE ATLAS

Adversarial threat landscape for AI systems. Our scenarios map to ATLAS tactics and techniques relevant to deployed agents.

EU AI Act — Article 15

Requires high-risk AI systems to demonstrate robustness against adversarial attempts. Our evaluation provides the signed behavioral evidence Article 15 demands.

How we evaluate.