The Drain pattern ID is a join column you can't trust
Every alert and dashboard keyed on a Drain pattern ID stops matching the moment that ID moves, silently, every time a pod restarts. We measured how badly it drifts. The answer turns on one variable, whether an instance sees the whole stream or only a slice: reordering the same stream barely moves the clusters (0.93 template overlap), while splitting it across instances collapses them (0.18).
Two things can move. The internal number a template gets depends on arrival order, so the numbers drift. And which templates exist depends on volume: a line type only clusters once it appears often enough, so a thin slice never grows the rare ones.
The dominant algorithm is Drain (He et al., ICWS 2017), a fixed-depth prefix tree that clusters lines by structural similarity. Drain3 is the popular open-source variant. Elastic described an earlier categorize_text preview as a modified Drain, though its current algorithm is its own; Datadog and Splunk don't publish theirs, but the output behaves like one. None publishes a cross-instance stability contract, because the ID a line gets depends on what the instance saw first.
Everything measured here is public: the corpus is a 215 MB release asset, the exact Drain3 configuration is spelled out below, and a short driver reproduces every number. The stable-hash architecture the post ends on is the commercial engine.
What we ran
The corpus is otel-sample-200mb.log, published as a release asset on the Log10x config repo: 215 MB, 197,430 lines of OpenTelemetry-demo output (an 8 MB gzip is attached alongside).
The parser is Drain3, configured as one production vendor's Helm chart does: depth=6, similarity threshold 0.6, max 20 children per node, max 2000 clusters, parameter token <*>. A typed-token preprocessor strips UUIDs, ISO timestamps, IPs, hex strings, and numbers, capped at 1024 chars.
Each cluster has two identities: an arbitrary internal number and a template string like user <*> logged in. We compared the strings, not the numbers. The integer ID you alert on moves even when the string holds, so every number below is a ceiling.
Two experiments:
- Shuffle. Feed Drain the same 197K lines in 30 random orders. Does the cluster set move?
- Split. Slice the corpus into 4 disjoint quarters, run Drain on each independently, 30 times. This models one instance seeing only part of the traffic, normal once you run more than one node.
Jaccard overlap below is the fraction of patterns two runs agree on: 1.0 identical, 0 none.
Drain is stable under shuffles
Across 30 shuffles of a 10K-line sample:
- Baseline cluster count (unshuffled): 1,450
- Cluster count across 30 shuffles: 1,464 ± 2.7 (range 1,457 to 1,469)
- Pairwise template-set Jaccard (n=30 runs): 0.9295 ± 0.0067 (min 0.91)
- Per-event identity across all 30 reorderings (10K sample): 99.6%
This was the surprise. We expected reordering to scramble the cluster set; instead the headline templates stay put and most events keep their template.
The 99.6% needs a caveat: it never tests the long tail an SRE alerts on. The top 10 templates account for 71% of events, and 1,206 of the 1,450 baseline templates never appear in the 10K sample, so the rare errors alerts target aren't exercised.
We never hit the max_clusters cap of 2000; the observed maximum was 1,469. A real fleet emits far more than 1,450 template types, so production will hit the cap. Then every new pattern evicts the least-recently-seen one, and a recurring pattern is reborn under a fresh ID. We never triggered that, so these are a floor.
Drain collapses under content splits
Across 30 four-way disjoint splits:
- Intra-trial pairwise Jaccard: 0.1807 ± 0.0027 (range 0.176 to 0.188)
- Clusters per quarter: 601 ± 16
Pairwise Jaccard drops to 0.18. Each quarter finds about 600 of the 1,450 templates the full dataset holds; the other 850 never cleared the clustering threshold in a thin slice. This is the second failure mode: not the same templates renumbered, but a smaller set.
Our slicing is random; real per-node ingestion is not, since node A sees checkout and node B sees auth. That structure diverges more, so production is worse.
A Drain pattern ID is a per-instance handle, not a join column
An alert keyed on patternId survives only in the shuffle case. Which case does production match?
- New pod starts the Drain instance from empty: split.
- Deploy shifts traffic between replicas: split.
- Scale-out routes streams to additional nodes: split.
- Window slides forward: split-like, slowly, as old data ages out and patterns re-learn under new IDs.
- Same long-lived instance keeps ingesting the same workload mix: shuffle, and the alert holds.
A Drain-derived patternId is a per-instance, per-window handle, fine as an in-session grouping key. It is not safe as the join column for an alert, a dashboard, an audit chain, or anything meant to outlive the instance that made it.
What the major vendors actually ship
Their query-time surfaces rebirth the ID on every search.
Datadog Log Patterns runs at query time over a 10,000-log sample. The docs don't name the algorithm; the Logs API exposes no stable ID.
Splunk ships three query-time surfaces: the cluster SPL command (a sequential cluster_label per run), the Patterns tab (resamples on each open), and findtypes (top 10 by default, adjustable, analyzing at most 5,000 events). Its one stable handle is eventtype, a human-authored saved-search rule, not an extraction.
Elastic ships categorize_text, ES|QL CATEGORIZE, and Discover Pattern Analysis, all query-time. It replaced the categorize_text algorithm in 8.3.0.
Elastic also ships the one genuine counterexample: an ML Categorization anomaly-detection job with a persisted, restart-surviving ID, the only such feature across the three. The model is checkpointed every three to four hours, so within one running job category_id survives restarts. The limits are the point: category_id is a per-job sequential integer, not portable to a second job, a rebuild, or another cluster. per_partition_categorization determines categories per partition while the ID stays unique only at the job level, a snapshot revert rolls the categorizer back with the rest of the model, and model_memory_limit stops new categories once the job hits the cap.
What actually gives a stable pattern ID?
No clustering parser in the LogPai benchmark line ships a stability-across-runs guarantee. They number patterns in first-seen order, so two nodes growing the same template still hand out disagreeing integers.
The fix that gives an instance-portable identity shows up in vendor implementations and industrial folklore more than in the academic survey line. Extract a normalized template string deterministically from each line, then use a content hash of that string as the ID. The normalization can come from any extractor; the hash is what travels. The same line produces the same hash on any node, after any restart, in any environment.
This is the architecture 10x uses for its pattern hash. A hash carries no state, so two machines that never communicate compute the same ID from the same line. That is "by construction": no coordination required, where a checkpointed integer resets at every boundary Elastic lists. The measurements above are what you get without that property.
Where to look
A short Drain3 driver reproduces this. Drain3, corpus, and the vendor docs: Datadog Log Patterns, Splunk cluster, Elastic categorize_text, Elastic ML Categorization, Elastic's Drain write-up (2022), 8.3.0 release notes.
For the architecture that avoids this by construction: Why every log line gets a stable identity.