How does a log line get a stable identity?

The engine tokenizes the line, labels each token against a source-derived vocabulary, drops the variable values, and computes a 64-bit FarmHash over the symbols and delimiters. The same log format produces the same hash on every node, restart, and environment.

How much smaller are template-encoded log events?

On the published 215 MB OTel-demo corpus, 133,080 events distill into 2,813 templates and 54.8 MB of encoded events, a 74.5% byte reduction, lossless.

Why every log line gets a stable identity, the way V8 does for objects

Most pattern-detection in mature observability platforms runs at query time. Splunk's cluster command, Datadog's Log Patterns view, Elastic's categorize_text aggregation all work this way. Feed any of them the same log line through two different searches and you can land in two different clusters with two different IDs. Fine for a dashboard, but a churning ID is useless for tracking a pattern across queries.

Runtimes solved this for objects. V8 gives every object a hidden class, a stable descriptor of its shape, computed once and shared by every same-shaped object for the program's life. We built 10x to do the same for log events. Every event gets a template, the hidden class for a log line.

That's the thesis. A stable ID matters when something downstream has to stay attached to the same pattern tomorrow, like an alert on one log shape or a per-pattern cost budget.

On the 215 MB OTel-demo corpus, 133,080 events collapse to 2,813 templates, a 74.5% byte reduction, and the same line hashes to the same template on every node. The corpus is public; the encoder that produces the hash is the commercial engine, so the determinism evidence a reader can check today is the measured contrast in the Drain post, not an encode-your-own-line run.

What a symbol is, and why structure isn't in the line

A log event's shape isn't declared anywhere. Two events from different log.info(...) calls are different shapes even when they look alike; two with the same skeleton and different runtime values are the same shape.

To tell those apart, the engine needs a symbol: a fixed string it knows came from source code, like a class name or a format string, not from runtime data. That split between structure and value is the whole trick.

The vocabulary of symbols comes from a library. Log10x ships a default library covering 150+ open-source frameworks, built by scanning their public source. An optional compile pass scans your repos and container images and adds your own code. (Where the symbol library actually comes from is its own post.)

How a log line becomes a stable hash

At runtime the engine tokenizes the line and labels each token as a known symbol or a variable value. It drops the variable values, then computes a 64-bit FarmHash over the symbols and the delimiters between them: the brackets, quotes, and punctuation that belong to the log format.

The hash input is only source-derived symbols, never runtime values. So the same log format produces the same template hash on every node, every restart, every environment. That hash is the key into the template cache.

An encoded fluentd event broken down: the leading blob is the stable template hash, and each comma value fills one $ slot in the template (timestamp, timezone, worker name, pid, ppid, worker number).

An example, on OTel-demo data

Here's a fluentd startup line from the OTel-demo sample that ships with 10x, abridged to its log message (the JSON envelope and kubernetes metadata get templated too):

2025-10-02 00:17:22 +0000 [info]: #0 starting fluentd worker pid=18 ppid=6 worker=0

The first time this shape appears, the engine extracts the template ($ marks each variable slot):

$(yyyy-MM-dd HH:mm:ss) +$ [info]: #0 starting $ worker pid=$ ppid=$ worker=$

Every later event becomes its template hash plus the values that filled the slots:

~R}>PZj;Jdp,1759364242000,0000,fluentd,18,6,0,...

The leading blob is the template hash. The rest is the timestamp, timezone, worker name, pid, ppid, and worker number, then the kubernetes and container metadata. Across the full 215MB OTel-demo log file, 133,080 events distill into 2,813 unique templates and 54.8MB of encoded events: a 74.5% byte reduction.

The reduction is the number for a slide. The determinism makes everything downstream possible. (Log events have class knowledge covers what opens up once every event has a stable identity.)

A design choice that runs against instinct

The instinct is to weight toward the symbols that show up most often. I'd argue the opposite, and built the engine to argue it: the more places a symbol appears, the less useful it is for picking the right origin.

Take Logger. In any real Java application it appears in hundreds of files, so it narrows the candidates from "everything" to "everything." Take BillingReconciler, three places maybe. Seeing it narrows the search dramatically.

So we capped it. The engine's default maxSymbolUnitsPerToken is 128: the most source locations it will consider for any one symbol. We tried raising it later for accuracy. It doesn't help: the extra slots fill with exactly the high-frequency noise the cap was keeping out. I'd have a hard time making a stronger claim about anything else in the engine.

Where to look

The full 215 MB OTel-demo corpus is published as a release asset on the open-source config repo, with an 8 MB gzip alongside; the templates and encoded events are what the engine generates when you run it on that corpus (the repo's data/ directories are runtime placeholders). The full runtime API reference is at doc.log10x.com. For code that already holds encoded events, standalone Java and JavaScript decoders rebuild the original lines. The claim to hold against all of this: the same line hashes to the same template every time, or the architecture is wrong.

Why every log line gets a stable identity, the way V8 does for objects

What a symbol is, and why structure isn't in the line

How a log line becomes a stable hash

An example, on OTel-demo data

A design choice that runs against instinct

Where to look

Read more

The Drain pattern ID is a join column you can't trust

Why MCP servers for logs break

We shipped a Rust UDF for ClickHouse, then deleted it: plain SQL was 100x faster

Cutting Elasticsearch log storage without rewriting a single query