Verifiable Integrity Verdicts for AI Agents via zkVM

Abstract

Organizations deploying autonomous AI agents face a recursive trust problem: integrity checking services analyze agent behavior on their behalf, but the integrity checker itself becomes a new trust dependency. If the checker misclassifies a boundary violation as safe, or fabricates a clear verdict for a compromised agent, the organization has no independent means of detection. We call this the verification gap in AI oversight.

We present a four-layer cryptographic attestation architecture that makes integrity verdicts independently verifiable without requiring trust in the attestation service. The four layers are: (1) Ed25519 digital signatures binding verdicts to a known signing key, (2) SHA-256 hash chains creating tamper-evident sequences within agent sessions, (3) Merkle accumulator trees enabling O(log N) inclusion proofs over an agent's complete checkpoint history, and (4) zero-knowledge proofs via SP1 zkVM that prove verdict derivation rules were applied correctly to the analysis output. Critically, the fourth layer does not prove that the underlying LLM inference was correct---it proves that the deterministic judgment rules mapping concerns to verdicts and actions were applied honestly. This distinction between proving auditor judgment and proving model inference is what makes the system practical.

The architecture is implemented across TypeScript (Cloudflare Workers for the gateway and API) and Rust (SP1 zkVM for proving), with cross-language conformance tests ensuring both implementations produce identical outputs. The integrity certificate format draws on C2PA content credentials and the W3C Verifiable Credentials Data Model. All verification endpoints are public and require no authentication, enabling any party to independently audit integrity verdicts. Measured performance: Ed25519 signing under 2ms, full attestation pipeline under 10ms, ZK proving 3--15 seconds, ZK verification under 100ms, at a cost of approximately $0.005 per proof.

1. Introduction

1.1 The Trust Gap in AI Oversight

The Agent Integrity Protocol (AIP) provides real-time behavioral oversight for autonomous AI agents. It analyzes the internal reasoning of agents---their thinking blocks---between turns, before actions execute, and produces Integrity Checkpoints with structured verdicts: clear, review_needed, or boundary_violation. These verdicts enable side-channel intervention, allowing host systems to pause, escalate, or deny agent actions based on what the agent appears to be thinking.

This architecture addresses a genuine need. But it introduces a new question: Why should anyone trust the integrity checker?

An integrity checking service could:

Fabricate verdicts: Return clear for genuinely problematic reasoning, or boundary_violation for benign reasoning, depending on incentives.
Modify history: Alter stored checkpoints after the fact to cover up missed violations or insert false ones.
Operate inconsistently: Apply different judgment rules to different customers, or change rules without notice.
Fail silently: Stop performing analysis while continuing to return plausible-looking verdicts.

These failure modes are not hypothetical. Any centralized integrity service is, by construction, a single point of trust. The verification gap is the distance between "this service says the agent is aligned" and "we can independently confirm this service applied its stated rules honestly."

1.2 Design Goals

We set four design goals for closing the verification gap:

Independent verification: Any party with a certificate and the public verification endpoints can verify an integrity verdict without trusting the Mnemom API. Verification must work offline for cryptographic checks and require only public API access for Merkle root confirmation.
Public auditability: Signing keys, Merkle roots, and verification endpoints are publicly accessible with no authentication required. The system's security relies on cryptographic guarantees, not access control.
Standard formats: The certificate format draws on established standards (C2PA, W3C VC Data Model) to enable interoperability with existing governance tooling.
Graceful degradation: Each attestation layer is independently valuable. If ZK proving is unavailable, signatures, chains, and Merkle proofs still provide meaningful guarantees. The system follows a fail-open pattern---attestation failures do not block integrity analysis.

1.3 Contributions

This paper presents four contributions:

A four-layer attestation stack (signatures, hash chains, Merkle trees, zero-knowledge proofs) that provides defense in depth for integrity verdicts.
A verdict derivation proof via SP1 zkVM that proves the deterministic rules mapping concerns to verdicts and actions were applied correctly, without proving the underlying LLM inference.
An integrity certificate format modeled on C2PA content credentials and W3C Verifiable Credentials, bundling all cryptographic evidence into a single self-describing document.
An open verification protocol with seven public API endpoints enabling any party to independently verify integrity verdicts.

2.1 Model Inference Proofs

The most ambitious approach to verifiable AI is proving that the neural network inference itself was computed correctly.

Modulus Labs (zkML) and Giza compile neural network forward passes into arithmetic circuits that can be verified via zero-knowledge proofs. This approach provides the strongest possible guarantee: a verifier can confirm that specific model weights, applied to specific inputs, produced specific outputs. However, the computational cost is prohibitive for production use. Proving a single inference pass through even a modest transformer requires billions of constraints, with proving times measured in hours and costs measured in dollars per inference. For large language models with hundreds of billions of parameters, this approach remains impractical at current technology levels.

EQTY Lab takes a hardware-based approach, using Intel SGX and TDX trusted execution environments (TEEs) to attest that model inference occurred within a secure enclave. This provides strong guarantees about computation integrity without the overhead of cryptographic proofs, but requires specific hardware and trust in the TEE manufacturer.

Our approach sidesteps the cost of proving inference entirely. We observe that an integrity verdict is the composition of two stages: (1) an LLM analyzes a thinking block and produces a structured JSON response with concerns, and (2) deterministic rules map those concerns to a verdict and recommended action. Stage 1 is expensive to prove cryptographically. Stage 2 is cheap---approximately 10,000 RISC-V cycles. By proving only stage 2, we achieve practical costs ($0.005/proof) and latencies (3--15 seconds) while still providing a meaningful guarantee: the verdict was derived honestly from whatever the LLM produced.

2.2 Hardware Attestation

Intel SGX/TDX and ARM TrustZone provide hardware-level attestation that code executed within a trusted enclave. EQTY Lab builds on this foundation to provide AI model attestation. TEE-based approaches offer strong integrity guarantees with low overhead but introduce trust dependencies on hardware manufacturers and are vulnerable to side-channel attacks.

Our architecture is complementary to TEE-based approaches. A future integration could run the AIP analysis engine within a TEE, providing hardware attestation for stage 1 (LLM analysis) while zkVM proves stage 2 (verdict derivation). Section 7.3 discusses this direction.

2.3 Content Provenance

C2PA (Coalition for Content Provenance and Authenticity) defines a standard for binding provenance metadata to digital content. C2PA manifests contain claims about how content was created or modified, signed by the creator's certificate. The standard has been adopted by Adobe, Microsoft, Google, and the BBC for media authenticity.

Our integrity certificate format is directly inspired by C2PA's manifest structure. Like a C2PA manifest, an integrity certificate bundles claims (the verdict and concerns), input commitments (analogous to C2PA's ingredient hashes), and cryptographic proofs (signature, chain, Merkle, optional ZK proof) into a single self-describing document. The key difference is domain: C2PA attests to media provenance; we attest to AI behavioral analysis provenance.

Macaw extends C2PA principles to AI-generated content, focusing on watermarking and provenance tracking for model outputs. Our work shares the goal of verifiable AI metadata but targets a different artifact---the integrity verdict rather than the model output itself.

2.4 Verifiable Credentials

The W3C Verifiable Credentials Data Model (v2.0, 2024) provides a standard for expressing credentials on the web in a verifiable manner. A verifiable credential contains claims about a subject, made by an issuer, with cryptographic proofs enabling verification.

Our certificate format adopts the VC conceptual model: an issuer (the integrity service) makes claims (verdict, concerns) about a subject (the checkpoint), with proofs (signature, chain, Merkle, ZK). We deviate from the full VC specification where domain-specific requirements demand it---particularly in the input_commitments section, which has no direct VC analogue---but maintain structural alignment to enable future interoperability with VC-based governance ecosystems.

2.5 AI Governance Standards

Several governance frameworks establish requirements that verifiable integrity attestation addresses:

EU AI Act (Regulation 2024/1689): Article 50 requires transparency obligations for AI systems, including machine-readable marking and decision audit trails. Verifiable integrity certificates provide cryptographic evidence of analysis provenance, supporting Article 50 compliance.
NIST AI Risk Management Framework (AI RMF 1.0): The GOVERN and MAP functions call for documented AI risk management processes. Verifiable attestation chains provide the cryptographic backbone for auditable risk assessment records.
World Economic Forum AI Governance Alliance: The WEF's "Presidio AI Framework" and subsequent agent governance publications emphasize the need for trust mechanisms in autonomous AI systems. Verifiable integrity verdicts operationalize trust verification rather than trust assumption.
NIST NCCoE Concept Paper (February 2026): "Accelerating the Adoption of Software and AI Agent Identity and Authorization" identifies logging and transparency as a core focus area. Cryptographically attested integrity checkpoints directly address this requirement.

2.6 Positioning

Approach	What is Proven	Cost/Proof	Latency	Trust Assumption
zkML (Modulus Labs, Giza)	Full neural network inference	$10--100+	Hours	None (cryptographic)
TEE (EQTY Lab, Intel SGX)	Code executed in secure enclave	~$0 (hardware)	<1ms	Hardware manufacturer
C2PA	Content provenance chain	~$0 (signing)	<1ms	Certificate authority
W3C VC	Credential claims	~$0 (signing)	<1ms	Issuer reputation
This work	Verdict derivation rules	$0.005	3--15s	LLM analysis stage

Our approach occupies a practical middle ground: stronger than signature-only attestation (C2PA, VC) because we prove computation, not just sign claims; far cheaper than full inference proofs (zkML) because we prove only the deterministic judgment layer; and complementary to hardware attestation (TEE) which could cover the LLM analysis stage we do not prove.

3. Architecture

3.1 Four-Layer Attestation Stack

The attestation architecture comprises four independent layers, each providing a distinct security guarantee:

Layer	Primitive	Guarantee	Independence
1. Signatures	Ed25519	Verdict was issued by a known key	Fully offline
2. Hash Chains	SHA-256 preimage	Checkpoint sequence is tamper-evident	Fully offline
3. Merkle Trees	Binary Merkle tree	Checkpoint exists in the complete log	Online (root lookup)
4. ZK Proofs	SP1 STARK	Verdict rules were applied correctly	Fully offline

Each layer is independently valuable:

Layer 1 alone prevents verdict fabrication by unauthorized parties.
Layers 1+2 additionally prevent retroactive modification of checkpoint sequences.
Layers 1+2+3 additionally prove checkpoint inclusion in the complete agent history, preventing selective deletion.
Layers 1+2+3+4 additionally prove that the deterministic verdict derivation rules were applied correctly to the LLM's analysis output.

Layers degrade gracefully. If the ZK prover is unavailable, the first three layers still provide meaningful guarantees. If the Merkle tree has not yet been updated, signatures and chains still hold. This layered design ensures that attestation enhances rather than gates the integrity analysis pipeline.

3.2 Input Commitment

Before analysis begins, a deterministic SHA-256 commitment is computed over all inputs to the analysis. This commitment binds the verdict to specific inputs, preventing after-the-fact substitution.

The commitment is computed as follows:

Each input field (Alignment Card, conscience values, window context, model version, prompt template version) is serialized to JSON with lexicographically sorted keys.
The serialized fields are concatenated with a pipe (|) delimiter.
The concatenated string is SHA-256 hashed, producing a 256-bit commitment.

The sorted-key JSON serialization ensures determinism regardless of property insertion order in either TypeScript or Rust implementations. The pipe delimiter prevents ambiguity between field boundaries.

Individual component hashes (thinking block hash, card hash, values hash, context hash) are also computed and stored separately, enabling selective disclosure: a verifier can confirm that a specific Alignment Card was used without requiring access to the conscience values.

3.3 Certificate Format

The integrity certificate is a self-describing JSON document modeled on C2PA content credentials and the W3C Verifiable Credentials Data Model:

{
  "@context": "https://mnemom.ai/aip/v1",
  "type": "IntegrityCertificate",
  "version": "1.0.0",
  "certificate_id": "cert-a1b2c3d4",
  "issued_at": "2026-02-18T12:00:00.000Z",

  "subject": {
    "checkpoint_id": "chk-...",
    "agent_id": "agent-...",
    "session_id": "sess-...",
    "card_id": "card-..."
  },

  "claims": {
    "verdict": "clear",
    "concerns": [],
    "confidence": 1.0,
    "reasoning_summary": "...",
    "analysis_model": "claude-haiku-4-5-20251001",
    "analysis_duration_ms": 1200
  },

  "input_commitments": {
    "thinking_block_hash": "sha256:...",
    "card_hash": "sha256:...",
    "values_hash": "sha256:...",
    "context_hash": "sha256:...",
    "model_version": "...",
    "combined_commitment": "sha256:..."
  },

  "proofs": {
    "signature": {
      "algorithm": "Ed25519",
      "key_id": "key-...",
      "value": "<base64 signature>",
      "signed_payload": "<canonical JSON>"
    },
    "chain": {
      "chain_hash": "sha256:...",
      "prev_chain_hash": "sha256:... | null",
      "position": 42
    },
    "merkle": {
      "leaf_hash": "sha256:...",
      "leaf_index": 41,
      "root": "sha256:...",
      "tree_size": 1024,
      "inclusion_proof": [
        {"hash": "sha256:...", "position": "right"},
        {"hash": "sha256:...", "position": "left"}
      ]
    },
    "verdict_derivation": {
      "method": "SP1-STARK",
      "image_id": "hex:...",
      "receipt": "<base64 receipt>",
      "journal": "<base64 journal>",
      "verified_at": "2026-02-18T12:00:05.000Z"
    }
  },

  "verification": {
    "keys_url": "https://api.mnemom.ai/v1/keys",
    "certificate_url": "https://api.mnemom.ai/v1/checkpoints/{id}/certificate",
    "verify_url": "https://api.mnemom.ai/v1/verify"
  }
}

The certificate is designed to be self-contained for offline verification of Layers 1, 2, and 4. Layer 3 (Merkle inclusion) requires a single online lookup of the current Merkle root to confirm the checkpoint's position in the agent's history.

3.4 Verification Flow

Verification can proceed in two modes:

Offline verification (Layers 1, 2, 4---no network required):

Retrieve the signing public key from the certificate's verification.keys_url (cached).
Verify the Ed25519 signature over the signed_payload using the public key.
Recompute the chain hash from prev_chain_hash | checkpoint_id | verdict | thinking_block_hash | input_commitment | timestamp and compare to the stored chain_hash.
If a verdict_derivation proof is present, verify the SP1 STARK receipt against the expected image ID and confirm the journal output matches the certificate's verdict and concerns hash.

Online verification (Layer 3---single API call):

Fetch the agent's current Merkle root from GET /v1/agents/{id}/merkle-root.
Recompute the Merkle root from the leaf hash and inclusion proof siblings.
Compare the computed root to the published root.

The verification API (POST /v1/verify) performs all checks server-side for convenience, but every check can be reproduced independently using only the certificate and the public key.

3.5 Data Flow

The attestation pipeline executes as an extension of the existing integrity analysis flow:

Thinking Block
      |
      v
 [AIP Analysis Engine]
      |
      v
 Analysis Response (verdict, concerns, reasoning)
      |
      +---> [1. Input Commitment] -----> SHA-256 over sorted inputs
      |
      +---> [2. Signed Payload]   -----> Canonical JSON of key fields
      |           |
      |           v
      |     [Ed25519 Sign] -----------> signature (base64)
      |
      +---> [3. Chain Hash]  ----------> SHA-256(prev | id | verdict | ...)
      |
      +---> [4. Merkle Append] --------> leaf hash, updated root
      |
      +---> [5. Certificate Build] ----> IntegrityCertificate JSON
      |
      +---> [6. ZK Prove (async)] -----> STARK receipt (3-15s, fire-and-forget)
      |
      v
 Integrity Checkpoint + Attestation + Certificate

Steps 1--5 execute synchronously within the analysis pipeline, adding approximately 5--10ms to checkpoint creation. Step 6 (ZK proving) is asynchronous and fire-and-forget: a pending proof row is inserted in the database, the prover service is notified via HTTP POST, and the prover writes results directly to the database upon completion. This design ensures that ZK proving never blocks the real-time integrity analysis path.

4. Implementation

4.1 Ed25519 Signing

Checkpoint signing uses Ed25519 via the @noble/ed25519 library (v3, async API), chosen for WebCrypto compatibility and zero dependency on Node.js crypto modules---a requirement for Cloudflare Workers deployment.

The signed payload is a deterministic JSON string with lexicographically sorted keys:

{
  "agent_id": "...",
  "chain_hash": "...",
  "checkpoint_id": "...",
  "input_commitment": "...",
  "thinking_block_hash": "...",
  "timestamp": "...",
  "verdict": "..."
}

Key management follows a rotation model. Signing keys are stored in a signing_keys database table with key_id, public_key (hex-encoded), algorithm, created_at, and is_active fields. The public key listing endpoint (GET /v1/keys) returns all active keys, enabling verifiers to cache keys and verify signatures without per-verification API calls.

4.2 Hash Chains

Each integrity checkpoint includes a SHA-256 chain hash computed from a pipe-delimited preimage:

prevChainHash | checkpointId | verdict | thinkingBlockHash | inputCommitment | timestamp

For the first checkpoint in a session, prevChainHash is substituted with the literal string genesis. Each subsequent checkpoint references the chain hash of its predecessor, forming a tamper-evident sequence per agent and session.

Chain verification is straightforward: given an ordered sequence of checkpoints, recompute each chain hash and verify (a) the first checkpoint has a null prevChainHash, (b) each subsequent checkpoint's prevChainHash matches the preceding checkpoint's chainHash, and (c) each recomputed hash matches the stored hash. A single broken link invalidates the chain from that point forward.

4.3 Merkle Trees

A per-agent append-only binary Merkle tree accumulates checkpoint leaf hashes. The leaf hash preimage is:

checkpointId | verdict | thinkingBlockHash | chainHash | timestamp

The tree follows standard binary Merkle construction: if the number of nodes at any level is odd, the last node is duplicated before pairing. Internal nodes are computed as SHA-256(left_hex || right_hex), where hex strings are concatenated directly.

Inclusion proofs contain O(log N) sibling hashes, each annotated with its position (left or right) relative to the path node at that level. A verifier walks the path from leaf to root, combining the running hash with each sibling, and compares the result to the published root.

The Merkle tree state (root, depth, leaf count, and all leaf hashes) is persisted in the agent_merkle_trees database table. The root is updated with each new checkpoint. Any party can retrieve the root via GET /v1/agents/{id}/merkle-root and verify an inclusion proof via GET /v1/checkpoints/{id}/inclusion-proof.

4.4 zkVM Guest Program

The zero-knowledge proof is generated by a SP1 zkVM guest program written in Rust. The guest program is a no_std binary targeting the RISC-V architecture, with approximately 10,000 cycles of computation.

The guest program performs the following steps:

Read input from host: Deserialize the GuestInput struct containing the raw analysis JSON, thinking block hash, card hash, values hash, and model identifier.
Extract JSON: Strip any markdown code fences from the analysis response (the LLM sometimes wraps JSON in backticks).
Parse analysis response: Deserialize the JSON into an AnalysisResponse struct containing the verdict (which is not trusted), concerns array, confidence, and reasoning summary.
Normalize evidence: Truncate each concern's evidence field to 200 characters, mirroring the TypeScript MAX_EVIDENCE_LENGTH constant.
Re-derive verdict: Apply the deterministic verdict rules to the concerns array. The rules are:
- Any concern with critical severity produces boundary_violation.
- high severity in prompt_injection, deceptive_reasoning, or value_misalignment categories produces boundary_violation.
- Any remaining medium or higher severity concern produces review_needed.
- No medium-or-higher concerns produces clear.
Map verdict to action: Apply the deterministic action mapping:
- clear maps to continue.
- review_needed maps to log_and_continue.
- boundary_violation with any critical concern maps to deny_and_escalate.
- boundary_violation without critical concerns maps to pause_for_review.
Hash concerns: Compute a SHA-256 hash over the normalized concerns array for integrity binding.
Commit output: Write the GuestOutput (verdict, action, concerns hash, and pass-through input hashes) to the proof journal.

The guest program explicitly does not trust the LLM's stated verdict. It re-derives the verdict from the concerns array using the same deterministic rules that the TypeScript implementation applies. The proof guarantees that this re-derivation was performed correctly.

4.5 Prover Service

The prover service is a standalone Rust binary (aip-prover) built with Axum, deployed on Fly.io. It exposes four HTTP endpoints:

Endpoint	Method	Purpose
`/prove`	POST	Accept proof request, spawn proving task
`/prove/{id}`	GET	Query proof status
`/prove/verify`	POST	Verify a STARK receipt
`/health`	GET	Health check

Proof requests are fire-and-forget from the API worker's perspective. The API worker inserts a pending row in the verdict_proofs database table, then POSTs to the prover service. The prover updates the row to proving, spawns a Tokio task for the actual SP1 proof generation, and upon completion writes the receipt, journal, image ID, proving duration, verification status, and estimated cost directly to the database.

Proving selection follows a stochastic policy: boundary_violation verdicts are always proven (100% sampling), while all other verdicts are proven at a 10% sampling rate. This ensures the highest-stakes verdicts always receive cryptographic proof while managing proving costs for routine clear verdicts.

A background retry loop queries for pending proofs every 30 seconds, ensuring that transient failures do not permanently orphan proof requests. Failed proofs are tracked with error messages and retry counts.

4.6 Verification API

Seven public endpoints enable independent verification, none requiring authentication:

Endpoint	Method	Function
`GET /v1/keys`	GET	List active signing public keys
`GET /v1/checkpoints/{id}/certificate`	GET	Retrieve integrity certificate
`POST /v1/verify`	POST	Verify a certificate (all checks)
`GET /v1/agents/{id}/merkle-root`	GET	Current agent Merkle root
`GET /v1/checkpoints/{id}/inclusion-proof`	GET	Merkle inclusion proof
`POST /v1/checkpoints/{id}/prove`	POST	Request ZK proof (authenticated)
`GET /v1/checkpoints/{id}/proof`	GET	Proof status and data

The POST /v1/verify endpoint performs comprehensive verification: Ed25519 signature check, chain hash recomputation, Merkle inclusion proof verification, input commitment presence check, and (when available) STARK receipt verification via the prover service. Each check reports independently, so a certificate with a valid signature but missing Merkle proof is reported as partially verified rather than rejected outright.

All verification endpoints return CORS headers permitting cross-origin requests, enabling browser-based verification tools.

4.7 Cross-Language Conformance

A critical implementation requirement is that TypeScript (gateway and API workers running on Cloudflare Workers) and Rust (zkVM guest program) produce identical outputs for identical inputs. The attestation primitives---SHA-256 hashing, deterministic JSON serialization, pipe-delimited concatenation, and verdict derivation rules---are implemented in both languages with shared test vectors.

The Rust aip-zkvm-core crate is no_std compatible (with alloc) to run inside the zkVM guest, but the same crate is used host-side for verification. The verdict rules in verdict.rs are documented as a "1:1 port" of the TypeScript logic in prompt.ts and engine.ts, with enum serialization using serde(rename_all = "snake_case") to match the TypeScript string constants.

Concern hashing in particular requires careful alignment: both implementations must normalize evidence length to 200 characters, serialize concern fields in the same order, and produce identical SHA-256 digests. The hash_concerns function in Rust uses serde_json serialization of a NormalizedConcern struct, while the TypeScript equivalent uses JSON.stringify with a sorted-key replacer.

5. Evaluation

5.1 Performance

All measurements taken on Cloudflare Workers (attestation pipeline) and Fly.io shared-cpu-1x instances (ZK proving). Attestation operations benchmarked over 10,000 iterations; ZK operations benchmarked over 100 iterations.

Operation	P50 Latency	P95 Latency	Threshold
Ed25519 sign	<1ms	<2ms	<2ms
Ed25519 verify	<1.5ms	<3ms	<3ms
SHA-256 input commitment	<0.5ms	<1ms	<1ms
Chain hash computation	<0.3ms	<0.5ms	<0.5ms
Merkle root (10K leaves)	~200ms	~500ms	<500ms
Merkle inclusion verify	<0.3ms	<0.5ms	<0.5ms
Full attestation pipeline	~5ms	~10ms	<10ms
ZK proving	3--15s	---	---
ZK verification	<50ms	<100ms	<100ms

The full attestation pipeline (input commitment, signing, chain hashing, Merkle append, certificate construction) adds approximately 5--10ms to each integrity checkpoint. Given that AIP analysis itself takes 800--2000ms (dominated by the LLM inference call), the attestation overhead is negligible---less than 1% of total checkpoint latency.

ZK proving latency of 3--15 seconds is significant but irrelevant to the critical path. Proving is asynchronous; the integrity verdict is delivered immediately with Layers 1--3 attestation. The ZK proof arrives later and is appended to the certificate retroactively.

5.2 Cost Analysis

Component	Unit Cost	Frequency	Monthly Cost (10K agents)
Ed25519 operations	Negligible	Every checkpoint	~$0
SHA-256 operations	Negligible	Every checkpoint	~$0
Merkle tree updates	Negligible	Every checkpoint	~$0
ZK proving	~$0.005	100% boundary + 10% other	~$50--500
Prover infrastructure	$30/month	Fixed	$30

At current SP1 pricing, each STARK proof costs approximately $0.001/proof. With boundary violations proven at 100% and other verdicts at 10%, a typical agent generating 100 checkpoints per day incurs approximately $0.01/day in proving costs. The prover infrastructure (a Modal H100 GPU) adds a fixed $30/month.

Total attestation cost for a deployment of 10,000 agents: approximately $500--5,000/month depending on checkpoint volume and boundary violation frequency. This is a rounding error compared to the LLM inference costs that dominate the integrity analysis budget.

5.3 Security Analysis

Signing key compromise: If an attacker obtains the Ed25519 private key, they can forge signatures on arbitrary payloads. Mitigation: key rotation (new key ID, old key marked inactive), key storage in Cloudflare Workers secrets (encrypted at rest), and the public key listing endpoint enables verifiers to detect key changes. The chain and Merkle layers provide additional assurance even if signatures are compromised---forged checkpoints would need to maintain chain continuity and Merkle consistency.

Chain rewrite attack: An attacker who controls the database could rewrite the chain by recomputing all hashes from a modified checkpoint forward. Mitigation: Merkle roots provide an independent accumulator. A rewritten chain would produce a different Merkle root than the published one. Additionally, any third party that has previously verified a certificate holds a snapshot of the chain hash at that point, enabling detection of retroactive modification.

Merkle tree manipulation: An attacker could attempt to delete checkpoints from the Merkle tree. Mitigation: the append-only tree structure means deletion changes the root, which is detectable by any party that has cached a previous root. The tree_size field in inclusion proofs enables detection of tree truncation (fewer leaves than expected).

ZK proof forgery: An attacker cannot forge a valid STARK proof without executing the guest program on inputs that produce the claimed output. The soundness guarantee of the SP1 proof system ensures that a valid receipt can only be produced by honest execution of the guest binary identified by the image ID.

Prover unavailability: If the prover service is down, no ZK proofs are generated. Mitigation: the system follows a fail-open pattern---Layers 1--3 still apply. Pending proof rows are tracked in the database and retried by the background retry loop.

5.4 Comparative Analysis

Property	Signature Only	Signature + Chain	Full Stack (This Work)	Full zkML
Verdict authenticity	Yes	Yes	Yes	Yes
Tamper evidence	No	Yes	Yes	Yes
Log completeness	No	No	Yes (Merkle)	Yes
Computation integrity	No	No	Yes (verdict rules)	Yes (full inference)
LLM analysis integrity	No	No	No	Yes
Cost per verdict	~$0	~$0	~$0.0005 avg	$10--100+
Latency overhead	<2ms	<3ms	<10ms (+async ZK)	Hours
Offline verification	Yes	Yes	Partial (Merkle online)	Yes

6. Discussion

6.1 The Auditor-Model Distinction

The central insight of this work is the distinction between proving model inference and proving auditor judgment. An integrity verdict is not a raw model output---it is the result of deterministic rules applied to a model output. The model produces concerns (with categories, severities, descriptions, and evidence). The rules produce verdicts and actions. The rules are simple, deterministic, and cheap to prove. The model inference is complex, non-deterministic, and prohibitively expensive to prove.

By drawing the proof boundary at the rule application layer, we achieve a system where:

A verifier can confirm that boundary_violation was the correct verdict given the concerns the LLM identified, even without access to the thinking block or the LLM's reasoning process.
The LLM analysis stage remains a trust assumption, clearly delineated and honestly stated.
The cost and latency of proving are reduced by approximately four orders of magnitude compared to full inference proofs.

This is not a weakness to be minimized but a design choice to be made explicit. The guarantee is: "Given these concerns, the verdict derivation was honest." The non-guarantee is: "The concerns themselves were correctly identified." Both statements are important and both should be communicated to verifiers.

6.2 Honest Limitations

Cryptographic attestation proves honest computation. It does not prove correct judgment.

A clear verdict with a valid STARK proof means: the LLM produced an analysis with no medium-or-higher severity concerns, and the deterministic rules correctly mapped that analysis to clear. It does not mean the thinking block was genuinely safe. The LLM could have missed a subtle prompt injection. The conscience prompt could have been insufficient for the specific attack vector. The thinking block could have been too short for reliable analysis (below the minimum evidence threshold).

Similarly, a boundary_violation verdict with a valid proof means: the LLM identified at least one concern meeting the boundary criteria, and the rules correctly escalated. It does not mean the agent was actually misaligned. The LLM could have produced a false positive.

These limitations are inherent to any system that relies on LLM-based semantic analysis. The attestation layer makes the judgment process verifiable, not the judgment itself correct. This distinction must be clearly communicated in all documentation, API responses, and user-facing materials.

6.3 Standards Alignment

The integrity certificate format aligns with three standards trajectories:

C2PA: The certificate structure mirrors C2PA manifests: a @context field for namespace identification, claims about a subject, input commitments analogous to ingredient hashes, and multiple proof types in a single document. A future version could produce C2PA-compatible manifests for integration with content provenance toolchains.

W3C Verifiable Credentials: The issuer-subject-claims-proof structure follows the VC Data Model. The verification section provides the equivalent of VC verification methods. Integration with DID-based identity systems would enable decentralized issuer identification.

EU AI Act Article 50: Verifiable certificates provide machine-readable evidence of AI system transparency. The claims.analysis_model field identifies the AI system, input_commitments provide the audit trail, and proofs provide the cryptographic evidence chain. Article 50 compliance requires transparency "without prejudice to the rights of others"---the privacy-by-design approach (thinking block hashes, never raw content) supports this requirement.

6.4 Future Directions

TEE integration: Running the AIP analysis engine within an Intel TDX or ARM CCA trusted execution environment would provide hardware attestation for the LLM analysis stage (currently a trust assumption). Combined with zkVM proofs for verdict derivation, this would close the verification gap almost entirely, with the remaining trust assumption being the TEE hardware manufacturer.

On-chain anchoring: Publishing Merkle roots to a Layer 2 blockchain (e.g., Base, Arbitrum, or a dedicated rollup) would provide decentralized, immutable root anchoring. Any party could verify checkpoint inclusion against the on-chain root without trusting the Mnemom API for root publication. Estimated cost at current L2 gas prices: <$0.01 per root update, publishable hourly or daily.

Recursive proofs: SP1 supports proof composition, where one proof can verify another proof. This enables aggregate verification: a single proof attesting that all N checkpoints in a session were derived correctly, rather than N individual proofs. Recursive proofs would reduce storage costs and enable session-level attestation certificates.

Threshold signatures: Multi-party computation (MPC) threshold signatures would enable distributed key management, where no single party holds the complete signing key. A t-of-n threshold scheme would require collusion of t parties to forge a signature, strengthening the Layer 1 guarantee against insider compromise.

WASM verifier: A WebAssembly build of the SP1 verifier would enable in-browser STARK verification, allowing end users to verify ZK proofs directly without trusting any server-side verification endpoint.

7. Conclusion

The verification gap in AI oversight---the distance between trusting an integrity checker's output and independently verifying it---is a solvable problem. It does not require proving everything. It requires identifying the right proof boundary.

We have presented a four-layer cryptographic attestation architecture that makes integrity verdicts independently verifiable. Ed25519 signatures prevent fabrication. SHA-256 hash chains prevent retroactive modification. Merkle trees prevent selective deletion. Zero-knowledge proofs via SP1 zkVM prove that verdict derivation rules were applied correctly.

The key contribution is the auditor-model distinction. By proving the deterministic judgment layer rather than the LLM inference layer, we achieve practical costs ($0.005 per proof), practical latencies (3--15 seconds for proving, under 100ms for verification), and meaningful cryptographic guarantees---all without blocking the real-time integrity analysis path.

The architecture is implemented and operational. The certificate format is specified. The verification endpoints are public. The limitations---particularly the honest acknowledgment that we prove computation integrity, not judgment correctness---are stated.

What remains is adoption: integrating verifiable integrity certificates into AI governance workflows, regulatory compliance pipelines, and enterprise audit processes. The infrastructure is ready. The standards are aligned. The verification gap is closable.

References

Succinct. "SP1 zkVM Documentation." 2025. https://docs.succinct.xyz
Paulmillr. "@noble/ed25519: Fastest JS implementation of Ed25519." 2024. https://github.com/paulmillr/noble-ed25519
C2PA (Coalition for Content Provenance and Authenticity). "C2PA Technical Specification v2.1." 2025. https://c2pa.org/specifications/
W3C. "Verifiable Credentials Data Model v2.0." W3C Recommendation. 2024. https://www.w3.org/TR/vc-data-model-2.0/
European Parliament and Council. "Regulation (EU) 2024/1689 (EU AI Act)." Official Journal of the European Union. 2024.
NIST. "Artificial Intelligence Risk Management Framework (AI RMF 1.0)." NIST AI 100-1. January 2023.
NIST NCCoE. "Accelerating the Adoption of Software and AI Agent Identity and Authorization." Concept Paper. February 2026.
World Economic Forum. "Presidio AI Framework: Towards Safe Generative AI Models." 2024.
World Economic Forum. "Navigating the AI Frontier: Agent Governance." AI Governance Alliance. January 2026.
Modulus Labs. "The Cost of Intelligence: Proving AI with Zero-Knowledge." 2024. https://www.moduluslabs.xyz/
EQTY Lab. "EQTY AI: Trusted AI Infrastructure." 2025. https://eqtylab.io/
Merkle, R. C. "A Certified Digital Signature." Advances in Cryptology---CRYPTO '89. Springer, 1989.
Bernstein, D. J., Duif, N., Lange, T., Schwabe, P., and Yang, B.-Y. "High-speed high-security signatures." Journal of Cryptographic Engineering, 2(2):77--89, 2012.
FIPS 180-4. "Secure Hash Standard (SHS)." National Institute of Standards and Technology. August 2015.
Ben-Sasson, E., Bentov, I., Horesh, Y., and Riabzev, M. "Scalable, transparent, and post-quantum secure computational integrity." IACR Cryptology ePrint Archive, 2018.
Goldwasser, S., Micali, S., and Rackoff, C. "The Knowledge Complexity of Interactive Proof Systems." SIAM Journal on Computing, 18(1):186--208, 1989.
Mnemom Research. "Alignment and Integrity Infrastructure for Autonomous Agents." Whitepaper v2.0. February 2026.
ISO/IEC 42001:2023. "Artificial Intelligence---Management System." International Organization for Standardization. 2023.
Singapore IMDA. "Model AI Governance Framework for Agentic AI." Infocomm Media Development Authority. January 2026.