Back to Materios
Guide · updated 2026-05-18

Inference logging
vs. observability.
What auditors need.

LLM observability tools — LangSmith, Datadog LLM Observability, Arize, Helicone, Langfuse — answer engineering questions. AI audit trails answer regulator questions. The two stacks overlap, but only one of them survives a five-year subpoena.

TL;DR

Short version.

LLM observability platforms are debugging tools for the team that owns the model. They capture prompt traces, latency, token cost, retrieval steps, and evaluation scores so engineers can iterate on prompts, agents, and RAG pipelines. They are excellent at this.

AI audit-trail systems are evidence stores for parties the operator does not control — regulators, courts, customers, counterparties. They capture a tamper-evident record of what the system did, signed in a way a third party can verify without trusting the operator. They are not interchangeable with observability.

The right answer for most production AI stacks in 2026 is to run both. Observability for the engineers; cryptographic logging for the auditors. Same traces, two surfaces. This guide explains why, what each layer actually does, and how to bolt them together without duplicating instrumentation.

Definition

What LLM observability does.

LLM observability is the operations-side telemetry surface for an AI application. The canonical vendors at the time of writing are LangSmith (LangChain's commercial product), Datadog LLM Observability, Arize Phoenix, Helicone, and Langfuse. They differ in framing and integrations, but the feature surface is broadly consistent:

  • Prompt and span tracing. Capture the inputs, outputs, intermediate tool calls, retrieval steps, and timing of every LLM invocation. View as a flame graph or tree.
  • Cost and latency metrics. Per-call token counts, model-by-model spend, p50/p95/p99 latency. Slice by user, environment, feature flag.
  • Evaluation harnesses. Score outputs against rubrics (LLM-as-judge, exact match, semantic similarity). Compare versions of a prompt or chain.
  • Dataset curation. Promote failing examples into regression suites. Replay traces with edits to debug.
  • Alerting. Page on cost spikes, error rates, quality regressions.

These are debugging and operations capabilities. The audience is the team that owns the model. The retention horizon is days to weeks for hot data, sometimes longer for evaluation datasets. The trust boundary is internal: engineers trust the platform because they pay for it.

None of these systems are designed to produce evidence to an outside party who does not trust the operator. That is not a criticism; it is not what they are for.

Audit grade

What an auditor actually needs.

An AI auditor — internal, external, regulatory — is asking a very different question than an engineer. The engineer asks “why did this trace fail and how do I fix it?” The auditor asks “can you prove this trace happened, in this order, with this model, and that nobody has altered the record since?”

Five properties separate the two. An audit-grade log has all of them; an observability log has, at most, one or two.

  • Tamper-evidence. Any insertion, deletion, or reordering of an event must be cryptographically detectable. A row in Postgres with an UPDATE permission is not tamper-evident. A rolling-hash event chain anchored to a public ledger is.
  • Third-party verifiability. An outside party must be able to verify the record without trusting the operator or the vendor. “Our SOC 2 says we don't modify logs” is not verifiability; it is assertion.
  • Retention discipline. EU AI Act Article 12 says high-risk systems retain logs for the “lifetime of the system.” That clause routinely outlives a SaaS contract. Audit storage must survive vendor changes, company shutdowns, and migrations.
  • Model-state binding. The log must bind each decision to the exact model, tokenizer, system prompt, and training-data state that produced it. Otherwise “the model changed since then” is an unfalsifiable defense.
  • Selective disclosure. The auditor may need to see one decision in isolation without exposing the other 100,000 traces. Merkle inclusion proofs do this; SQL queries against a SaaS dashboard do not.

See the audit-trail pillar for the long-form treatment of why these five properties are the floor, not a wishlist.

Comparison

Where they overlap, where they diverge.

Both stacks instrument the same thing — the inference call, its inputs, outputs, and intermediate steps. The divergence is what happens to those events after capture. Side-by-side:

 
LLM observability
LangSmith, Datadog, Arize, Helicone, Langfuse
AI audit trail
Cryptographic anchoring · Orynq
Primary audienceEngineers, ML ops, prompt authorsAuditors, regulators, courts, customers
Trust modelInternal — operator trusts vendorExternal — third party verifies independently
Tamper-evidenceNo (mutable DB rows)Yes (rolling-hash + on-chain anchor)
RetentionDays to months; vendor-controlledLifetime of the system; outlives vendors
Model-state pinningOptional tag fieldsManifest hash pinned per trace
Selective disclosureReveal full trace or noneMerkle inclusion proof per event
Latency toolingFirst-class — flame graphs, p99 dashboardsOut of scope
Evaluation harnessFirst-class — eval datasets, scoringOut of scope
Cost$$$ per-seat SaaS~0.2–0.3 ADA per anchor (self-hosted)

These are not competitors. The Venn-diagram overlap is the captured event stream; everything past that diverges by audience.

Pattern

The dual-stack pattern.

Most teams arrive at the same architecture once they have shipped both an internal ML ops practice and an external compliance obligation. The pattern: one capture layer, two sinks.

  1. 01

    Capture once.

    Instrument the agent or model server with a single trace primitive that emits structured events — span start, tool call, retrieval result, completion, span end. Most teams already have this via OpenTelemetry or an agent framework's built-in tracing.

  2. 02

    Sink to observability.

    Forward the event stream to LangSmith / Langfuse / Datadog / Arize for engineering surfaces. Tune dashboards, alerts, evaluation datasets there. Treat this as ephemeral hot storage with a 30–180 day window.

  3. 03

    Sink to anchoring.

    Pass the same events into orynq tracing. Pin a model manifest hash at trace start. At trace close, finalize the Merkle bundle and anchor the root to Cardano under metadata label 2222.

  4. 04

    Cross-reference by trace ID.

    Use a single trace ID across both surfaces so a single decision can be looked up by engineers in observability and verified by auditors against the on-chain anchor — no duplicated instrumentation, no diverging IDs.

  5. 05

    Persist the bundle.

    Anchoring proves the bundle existed; you still need to store the bundle so it is retrievable years later. Object storage with WORM (Object Lock), IPFS pinning, or Arweave permanent storage are the three common patterns.

The audit-trail pillar has the full cryptographic pattern write-up. The proof-of-inference pillar covers the anchoring deep dive. The Orynq SDK reference page shows the dual-sink integration in code: Orynq SDK.

FAQ

Common questions.

  • Can't I just store LangSmith traces forever and call it an audit trail? You can store them, but they remain mutable rows in a vendor database. An auditor who does not trust you (or who is required to verify independently) has no cryptographic way to confirm they have not been edited. That is the gap anchoring closes.
  • Does Datadog LLM Observability satisfy the EU AI Act? Article 12 requires “automatic recording of events” and traceability over the lifetime of the system. Datadog will record events; whether it satisfies the lifetime-of-system retention and the traceability requirement is a question for your DPO and legal counsel. Most reading of the article treats a SaaS dashboard as one part of the answer, not the whole answer — especially where the system outlives a SaaS contract. See the Article 12 plain-English guide.
  • What about Helicone or Langfuse self-hosted? They're open source. Self-hosting removes the vendor-trust-boundary problem but does not add tamper-evidence. The operator can still edit the underlying database. Anchoring is orthogonal: run Langfuse for engineers, anchor the same events for auditors.
  • If I anchor, do I still need observability? Almost always yes. Anchoring is forensic — optimized for “produce a verifiable record of trace X.” It is not where you live during a Monday-morning incident review. Keep the observability stack for engineers; add anchoring for auditors.
  • Won't this double my instrumentation cost? No. The capture layer is the same. The two sinks differ in what they do with captured events; instrumentation lives once, in your agent. The economics are: SaaS observability for your engineers (high $, hot data, short retention) plus ~0.2–0.3 ADA per anchored trace (low $, cold evidence, lifetime retention).
Ship it

Observability is for engineers. Anchoring is for auditors.

Add the audit layer in an afternoon. Keep the observability stack your team already loves.