Where log structure actually lives

Share

A log line is a template plus some data, and the template does not have to be guessed at ingest. It sits in the code that emitted the line: your repos, open-source libraries, the compiled binary, the Helm chart, the jar in Artifactory.

We built a compiler that extracts the template from all of those and links them into one library, the way a C toolchain links .o files into a .so. The runtime then consults that library instead of inferring structure from the values it saw.

Here is the line that makes the case:

2025-09-29 09:01:23.456 INFO o.s.b.w.embedded.tomcat.TomcatWebServer : Tomcat started on port(s): 8080 (http) with context path ''

Most of this repeats across every Spring Boot restart. The class name TomcatWebServer, the format string Tomcat started on port(s): X (http) with context path Y, and the INFO severity are structural. The 8080 is data.

A vocabulary is the set of fixed strings the code can print: every class name, format-string literal, and severity label. The runtime checks each token against that set instead of guessing from how often it repeats.

Frequency guessing breaks on obvious cases. The same 8080 is a TCP port in one service and a row count in another:

Server listening on port 8080
Flushed 8080 rows to the segment store

A frequency guesser decides both 8080s are the same field and merges two unrelated shapes. The vocabulary has to come from a stable place: the code.

Log lines tinted token by token: structural tokens from your source (class names, format strings, INFO) in blue, runtime data (timestamps, the 8080s) in amber, showing the same value 8080, a port in one shape and a row count in another, never collides.

The compiler has the same shape as a C build.

  • Pull. Fetch source files, binary artifacts, and Helm chart contents from their stores. Cache by content hash so unchanged inputs are never reprocessed.
  • Scan. Run each input through the right scanner for its format. Output: per-file .10x.json symbol units, the .o of this build.
  • Link. Merge the symbol units into one library, the way ld combines .o files into a .so.
  • Push. Commit the library to a target GitHub repo. The runtime pulls from there at startup.

At runtime the vocabulary lets the engine blank the data from each line, leaving one template per shape. The hash is taken over that template, so identical shapes collapse to one for dedup and cost tracking.

Relink is deterministic: the same inputs always produce the same library byte-for-byte. Each shape's hash stays identical across deploys, so dashboards and savings numbers do not reset every release (more on the runtime story).

Polyglot on two axes

A stack has variety on two axes: many source systems, many formats in each.

Most lines in a microservice come from Spring or the Kafka client, not your code. So the pull module carries an adapter per source: github, docker, helm, artifactory, plus a go.mod resolver that expands Go dependencies into GitHub pulls.

Inside any one source the content is heterogeneous. Source code spans seven-plus languages (Java, Go, Python, C#, C++, JavaScript, Scala), each routed to a scanner that knows its grammar: a dedicated parser for Java, Python, and Scala, and one shared ANTLR walker for the rest. Text configs carry structural strings as keys and labels. And then there are compiled binaries.

The executable scanner walks a binary and extracts printable strings, the same idea as the Unix strings command. Most log lines from a Go service live in the binary, not the source, so that scanner is not an edge case. It is also the extension point for any source that emits candidate strings.

Why we ship a precompiled default library

It comes down to overlap. Spring Boot is the same Spring Boot at customer A and customer B, so making every customer recompile the public projects everyone shares is a tax on the common case.

So we ran the pipeline against the projects we see most (Spring Boot, Kafka, Logback, etcd, containerd, ingress-nginx) and bundled the output. The compile step then runs only for application-specific code the default does not cover. My own take: for most stacks I would compile your proprietary code alone and let the default carry the rest.

Most tooling instead pushes the vocabulary problem onto you. Splunk sourcetype field extractions, Datadog log processing pipelines, Elasticsearch ingest grok rules, Cribl and Vector and Logstash transforms: all hand-written. Extracting the vocabulary from where it lives is the third option, and it works only because the compiler reads every source and format in the environment.

The compile pipeline modules are open source at github.com/log-10x/modules/tree/main/pipelines/compile; clone it and run it against your own repos and images. The ANTLR scanner covers language support; hidden log classes covers what attaches to templates.