Why do MCP servers for logs break?

Because most pattern IDs are produced by query-time re-clustering and drift between calls. An agent following a reference across tools dereferences a name that no longer exists, so every tool degenerates into a separate grep.

What made the investigate tool confidently wrong?

A Prometheus baseline guard passed near-zero divisors (a single sample at 0.0012 events per second), inflating relative changes to +9000%. The fix floors the baseline side at ten times the noise floor while leaving the current side open.

Why MCP servers for logs break

An environment-wide audit had log10x_investigate flagging patterns as "+100%" spikes; a per-service rerun on the same data found nothing crossed the noise floor. Two answers from the same tool, same data.

The bug was in the topk rate-change query, where we divide the current rate by the baseline rate. In Prometheus, the baseline guard > noise_floor is a filter, not a true/false test: clear the floor and your actual value passes through as the divisor. A baseline of a single sample at 0.0012 events per second clears the floor but is still a tiny divisor, so the relative change inflates to +9000% though the absolute rate is trivial. Those phantom movers floated to the top of the ranker, quoted with full confidence.

The fix comes later in the post. The diagnosis matters more, because the failure it exposes is not ours alone.

The server in question is at github.com/log-10x/log10x-mcp, Apache 2.0, installable with npx -y log10x-mcp. The Dev CLI tier tools, log10x_resolve_batch and log10x_dependency_check, work with no Log10x backend; the investigation tools need the paid Reporter and Receiver pipeline behind them.

Why they break

Datadog Log Patterns and Splunk Pattern Explorer re-cluster on every query, so the cluster ID drifts from one hour to the next. That churn is invisible to someone reading a dashboard. To an agent following a reference across logs, metrics, and traces, those IDs are useless: the agent dereferences a name that no longer exists, and the server collapses into dozens of separate ways to grep.

So here's the opinion I'll defend. Frequency-based pattern clustering is fine for a dashboard and broken as the foundation an agent stands on. A dashboard reader wants the loudest thing; an agent following a causal chain wants a vocabulary that holds still. Two different products, shipped as one for a decade.

An engineer pasted a log line into Claude and asked "is this new." The answer was correct because every tool in our MCP server rides on one identifier I'll call the pattern hash.

The pattern hash is the same string for every line from the same logging statement in your code, unchanged across queries, restarts, and deploys. cart_cartstore_ValkeyCartStore is one such name, derived deterministically from the source-code logging statement, not from runtime clustering.

The tools compose because the names compose. log10x_top_patterns returns a dollar-ranked list of pattern hashes. Feed one to log10x_investigate for a causal chain of other pattern hashes. Feed one of those to log10x_retriever_query for the raw events out of your own S3 archive. No tool may invent a name, rename a pattern, or re-cluster the universe between calls.

What `log10x_investigate` does

You hand investigate an anchor and it works out what moved around it. The anchor is a string: a raw log line, a pattern hash, a service name, or the token environment for "investigate everything at once."

It classifies the anchor's time-series shape as acute, drift, or flat. Flat terminates with an honest empty result; acute branches into the correlation engine.

The correlation engine queries Prometheus for every other pattern in scope, shifting the comparison window by a few offsets. If a candidate's curve lines up with the anchor's shifted back sixty seconds, that candidate moved first. The offsets are in src/lib/correlate.ts:

const LAG_OFFSETS_SECONDS = [30, 60, 120, 180, 300];

Patterns that peaked before the anchor get linked into a chain ordered by lead time. Each link carries a confidence sub-score, all visible to the model:

// src/lib/correlate.ts
export interface ChainLink {
  mover: CoMover;
  /** Stat sub-score (0-1) — magnitude above noise floor. */
  stat: number;
  /** Lag tightness (0-1) — sharpness of the peak across offsets. */
  lag: number;
  /** Chain coherence (0-1) — how well this link fits the chain vs star pattern. */
  chain: number;
  /** Final per-link confidence = stat * lag * chain. */
  confidence: number;
}

Chain coherence rewards links that form a line of cause and effect and penalizes a hub where everything points at one node by coincidence. investigate emits this decomposition rather than let the model grade its own confidence, which once gave the prose you'd expect: high on everything, no relationship to the data.

One tool earns its place by refusing

log10x_discover_join finds the field that lines up a log pattern with its APM service. It scores how much the label values on each side overlap, using Jaccard similarity, and treats two labels as the same dimension when that overlap clears 0.7, relaxing to 0.4 for pairs whose names already alias the same dimension. Below the floor, it returns a structured no_join_available refusal instead of a temporal-only ranking.

The refusal is the feature. A correlation with no structural evidence is a coincidence, and calling a coincidence a finding is the failure mode the join exists to prevent.

The fix is asymmetric on purpose

Back to the phantom spike from the opening. The fix guards the baseline side with a floor of ten times the per-series noise floor. The base floor is 0.001 events per second, so the guard is 0.01: thirty-six events an hour, or you don't qualify as a divisor.

The argument is in what we did not guard. The current side stays open, because a spike from a quiet baseline up to a moderate rate is a real incident. A crashlooping pod has a low absolute rate but is exactly what investigate should flag: the accounting service's Kerberos dlopen failure, 22 restarts averaging 8 events an hour. The floor is asymmetric on purpose, killing the near-zero-baseline amplification while keeping the low-volume incident.

The server ships a NUMBERS DISCIPLINE block the model reads every session: don't recompute a percentage the tool already emits, don't invent peak values when the tool returns window averages, treat honest empty returns as a feature. Every line is there because we watched a model do that wrong thing once and fixed it in the prompt.

Tools tier with infrastructure, not with a paywall

A real decision sat at the front of building this: ship every tool always and let it fail without the infrastructure, or tier them so you only see tools your environment can answer? We picked the second: a useful log assistant shouldn't be conditional on a vendor relationship, so the tools that need no Log10x deployment keep working with none.

The Dev CLI tier needs no pipeline infrastructure. Paste lines from a Slack incident into log10x_resolve_batch for per-pattern frequency, severity, and variable concentration. About to delete a log.info() call? log10x_dependency_check returns a script that scans your SIEM, dashboards, and alerts for anything still referencing it.

The paid tiers add tools only when the infrastructure can answer them. Reporter, the read-only fluent-bit DaemonSet, emits per-pattern cost and volume time series. Receiver, the sidecar, filters, samples, and compacts events in flight and sees the ones dropped before the SIEM gets them. Together they add top_patterns, pattern_trend, event_lookup, savings, and investigate itself. Retriever adds log10x_retriever_query, forensic retrieval of the events the Receiver held back, read from your own S3 bucket through pre-computed Bloom filters.

A deterministic foundation under tools that refuse to fabricate: that is what "AI-queryable observability" has to mean. Anything short of it is a chatbot in front of a SIEM.

The server is at github.com/log-10x/log10x-mcp, Apache 2.0, installable with npx -y log10x-mcp. The correlation engine is in src/lib/correlate.ts, the orchestration in src/tools/investigate.ts, the instructions block in src/index.ts. Wire it against your own pipeline, follow a chain from top_patterns through investigate into retriever_query, and tell me whether any link it returns is one you'd disagree with.

Related: why Drain pattern IDs drift, what else lives on those patterns, and where log structure actually lives. For the same data inside a SIEM, see the Elasticsearch, Splunk, and ClickHouse posts.

Why MCP servers for logs break

Why they break

What `log10x_investigate` does

One tool earns its place by refusing

The fix is asymmetric on purpose

Tools tier with infrastructure, not with a paywall

Read more

The Drain pattern ID is a join column you can't trust

We shipped a Rust UDF for ClickHouse, then deleted it: plain SQL was 100x faster

Cutting Elasticsearch log storage without rewriting a single query

Cutting Splunk log storage without rewriting a single query

Why they break

What log10x_investigate does

One tool earns its place by refusing

The fix is asymmetric on purpose

Tools tier with infrastructure, not with a paywall

Read more

The Drain pattern ID is a join column you can't trust

We shipped a Rust UDF for ClickHouse, then deleted it: plain SQL was 100x faster

Cutting Elasticsearch log storage without rewriting a single query

Cutting Splunk log storage without rewriting a single query

What `log10x_investigate` does