Can't I just store LangSmith or Langfuse traces forever and call it an audit trail?

You can store them, but they remain mutable rows in a vendor database. An auditor who does not trust the operator has no cryptographic way to confirm the records have not been edited. Cryptographic anchoring closes that gap by recording a tamper-evident hash chain on a public ledger the operator does not control.

Does Datadog LLM Observability satisfy EU AI Act Article 12?

Article 12 requires automatic recording of events and traceability for the lifetime of the high-risk AI system. A SaaS observability dashboard is one part of the answer but is typically not sufficient on its own — the system frequently outlives a SaaS contract, and the article is widely read as requiring tamper-evident records the operator cannot silently mutate. Consult your DPO and legal counsel.

If I anchor traces on-chain, do I still need LLM observability?

Almost always yes. Cryptographic anchoring is forensic — optimized for producing a verifiable record of one specific trace to a third party. It is not where engineers live during a Monday-morning incident review. Keep observability for engineers; add anchoring for auditors. The dual-stack pattern shares one capture layer across both.

What are the five properties auditors need that LLM observability tools don't provide?

Tamper-evidence (cryptographic detection of any change), third-party verifiability (no operator or vendor trust required), retention discipline matching the lifetime of the system, model-state binding via a pinned manifest hash, and selective disclosure via Merkle inclusion proofs so a single decision can be revealed without exposing the rest of the trace history.

How much does the audit-trail layer cost on top of LLM observability?

Approximately 0.2 to 0.3 ADA per anchored trace on Cardano L1 (roughly 0.10 to 0.20 USD at 2026 prices) when self-hosted via the Orynq SDK. The capture layer is shared with the observability stack, so instrumentation is not duplicated.

FLUX POINT STUDIOS Contact

Back to Materios

Guide · updated 2026-05-18

Inference logging
vs. observability.
What auditors need.

LLM observability tools — LangSmith, Datadog LLM Observability, Arize, Helicone, Langfuse — answer engineering questions. AI audit trails answer regulator questions. The two stacks overlap, but only one of them survives a five-year subpoena.

Skip to the dual stack Pillar: AI audit trail

TL;DR

Short version.

LLM observability platforms are debugging tools for the team that owns the model. They capture prompt traces, latency, token cost, retrieval steps, and evaluation scores so engineers can iterate on prompts, agents, and RAG pipelines. They are excellent at this.

AI audit-trail systems are evidence stores for parties the operator does not control — regulators, courts, customers, counterparties. They capture a tamper-evident record of what the system did, signed in a way a third party can verify without trusting the operator. They are not interchangeable with observability.

The right answer for most production AI stacks in 2026 is to run both. Observability for the engineers; cryptographic logging for the auditors. Same traces, two surfaces. This guide explains why, what each layer actually does, and how to bolt them together without duplicating instrumentation.

Definition

What LLM observability does.

LLM observability is the operations-side telemetry surface for an AI application. The canonical vendors at the time of writing are LangSmith (LangChain's commercial product), Datadog LLM Observability, Arize Phoenix, Helicone, and Langfuse. They differ in framing and integrations, but the feature surface is broadly consistent:

Prompt and span tracing. Capture the inputs, outputs, intermediate tool calls, retrieval steps, and timing of every LLM invocation. View as a flame graph or tree.
Cost and latency metrics. Per-call token counts, model-by-model spend, p50/p95/p99 latency. Slice by user, environment, feature flag.
Evaluation harnesses. Score outputs against rubrics (LLM-as-judge, exact match, semantic similarity). Compare versions of a prompt or chain.
Dataset curation. Promote failing examples into regression suites. Replay traces with edits to debug.
Alerting. Page on cost spikes, error rates, quality regressions.

These are debugging and operations capabilities. The audience is the team that owns the model. The retention horizon is days to weeks for hot data, sometimes longer for evaluation datasets. The trust boundary is internal: engineers trust the platform because they pay for it.

None of these systems are designed to produce evidence to an outside party who does not trust the operator. That is not a criticism; it is not what they are for.

Audit grade

What an auditor actually needs.

An AI auditor — internal, external, regulatory — is asking a very different question than an engineer. The engineer asks “why did this trace fail and how do I fix it?” The auditor asks “can you prove this trace happened, in this order, with this model, and that nobody has altered the record since?”

Five properties separate the two. An audit-grade log has all of them; an observability log has, at most, one or two.

Tamper-evidence. Any insertion, deletion, or reordering of an event must be cryptographically detectable. A row in Postgres with an UPDATE permission is not tamper-evident. A rolling-hash event chain anchored to a public ledger is.
Third-party verifiability. An outside party must be able to verify the record without trusting the operator or the vendor. “Our SOC 2 says we don't modify logs” is not verifiability; it is assertion.
Retention discipline. EU AI Act Article 12 says high-risk systems retain logs for the “lifetime of the system.” That clause routinely outlives a SaaS contract. Audit storage must survive vendor changes, company shutdowns, and migrations.
Model-state binding. The log must bind each decision to the exact model, tokenizer, system prompt, and training-data state that produced it. Otherwise “the model changed since then” is an unfalsifiable defense.
Selective disclosure. The auditor may need to see one decision in isolation without exposing the other 100,000 traces. Merkle inclusion proofs do this; SQL queries against a SaaS dashboard do not.

See the audit-trail pillar for the long-form treatment of why these five properties are the floor, not a wishlist.

Comparison

Where they overlap, where they diverge.

Both stacks instrument the same thing — the inference call, its inputs, outputs, and intermediate steps. The divergence is what happens to those events after capture. Side-by-side:

	LLM observability LangSmith, Datadog, Arize, Helicone, Langfuse	AI audit trail Cryptographic anchoring · Orynq
Primary audience	Engineers, ML ops, prompt authors	Auditors, regulators, courts, customers
Trust model	Internal — operator trusts vendor	External — third party verifies independently
Tamper-evidence	No (mutable DB rows)	Yes (rolling-hash + on-chain anchor)
Retention	Days to months; vendor-controlled	Lifetime of the system; outlives vendors
Model-state pinning	Optional tag fields	Manifest hash pinned per trace
Selective disclosure	Reveal full trace or none	Merkle inclusion proof per event
Latency tooling	First-class — flame graphs, p99 dashboards	Out of scope
Evaluation harness	First-class — eval datasets, scoring	Out of scope
Cost	$$$ per-seat SaaS	~0.2–0.3 ADA per anchor (self-hosted)

These are not competitors. The Venn-diagram overlap is the captured event stream; everything past that diverges by audience.

Pattern

The dual-stack pattern.

Most teams arrive at the same architecture once they have shipped both an internal ML ops practice and an external compliance obligation. The pattern: one capture layer, two sinks.

01
Capture once.
Instrument the agent or model server with a single trace primitive that emits structured events — span start, tool call, retrieval result, completion, span end. Most teams already have this via OpenTelemetry or an agent framework's built-in tracing.
02
Sink to observability.
Forward the event stream to LangSmith / Langfuse / Datadog / Arize for engineering surfaces. Tune dashboards, alerts, evaluation datasets there. Treat this as ephemeral hot storage with a 30–180 day window.
03
Sink to anchoring.
Pass the same events into orynq tracing. Pin a model manifest hash at trace start. At trace close, finalize the Merkle bundle and anchor the root to Cardano under metadata label 2222.
04
Cross-reference by trace ID.
Use a single trace ID across both surfaces so a single decision can be looked up by engineers in observability and verified by auditors against the on-chain anchor — no duplicated instrumentation, no diverging IDs.
05
Persist the bundle.
Anchoring proves the bundle existed; you still need to store the bundle so it is retrievable years later. Object storage with WORM (Object Lock), IPFS pinning, or Arweave permanent storage are the three common patterns.

The audit-trail pillar has the full cryptographic pattern write-up. The proof-of-inference pillar covers the anchoring deep dive. The Orynq SDK reference page shows the dual-sink integration in code: Orynq SDK.

Try Orynq SDK Read the pillar

FAQ

Common questions.

Can't I just store LangSmith traces forever and call it an audit trail? You can store them, but they remain mutable rows in a vendor database. An auditor who does not trust you (or who is required to verify independently) has no cryptographic way to confirm they have not been edited. That is the gap anchoring closes.
Does Datadog LLM Observability satisfy the EU AI Act? Article 12 requires “automatic recording of events” and traceability over the lifetime of the system. Datadog will record events; whether it satisfies the lifetime-of-system retention and the traceability requirement is a question for your DPO and legal counsel. Most reading of the article treats a SaaS dashboard as one part of the answer, not the whole answer — especially where the system outlives a SaaS contract. See the Article 12 plain-English guide.
What about Helicone or Langfuse self-hosted? They're open source. Self-hosting removes the vendor-trust-boundary problem but does not add tamper-evidence. The operator can still edit the underlying database. Anchoring is orthogonal: run Langfuse for engineers, anchor the same events for auditors.
If I anchor, do I still need observability? Almost always yes. Anchoring is forensic — optimized for “produce a verifiable record of trace X.” It is not where you live during a Monday-morning incident review. Keep the observability stack for engineers; add anchoring for auditors.
Won't this double my instrumentation cost? No. The capture layer is the same. The two sinks differ in what they do with captured events; instrumentation lives once, in your agent. The economics are: SaaS observability for your engineers (high $, hot data, short retention) plus ~0.2–0.3 ADA per anchored trace (low $, cold evidence, lifetime retention).

Ship it

Observability is for engineers. Anchoring is for auditors.

Add the audit layer in an afternoon. Keep the observability stack your team already loves.

Try Orynq SDK Read the pillar

Inference loggingvs. observability.What auditors need.

Short version.

What LLM observability does.

What an auditor actually needs.

Where they overlap, where they diverge.

The dual-stack pattern.

Capture once.

Sink to observability.

Sink to anchoring.

Cross-reference by trace ID.

Persist the bundle.

Common questions.

Observability is for engineers. Anchoring is for auditors.

Inference logging
vs. observability.
What auditors need.