Stop extracting log fields on every event

Share

Open any Kubernetes control-plane log and you'll see lines like E1023 14:30:45.123456 controller.go:123 connection refused. That E is the severity. There's no word ERROR in the line. klog glues severity to the front of the timestamp as a single character. Grep every event for ERROR and you miss all of them. The error-rate dashboard reads zero while the controller fails.

Here's my claim: the information you need about an event usually isn't in its values. It's in the shape the event came from. A 3-digit number could be an HTTP status code, or it could be max.request.size = 200 in a kafka config dump. The value 200 can't tell you which. The shape can. So attach the work to the shape, not the value.

JavaScript engines already solved a structurally identical problem. Object shape isn't declared in source, so they can't optimize access object by object. V8 records each object's shape as a hidden class (PyPy and CPython 3.11 do the same thing). On first sight of that class it finds where a property lives and caches the offset at the access site. Discover once, attach once, every instance inherits.

A template is a log line with its variable parts blanked out

We apply the same architecture to log events. A template is the log line with its changing parts blanked out. E$TIME controller.go:$N connection refused is one template, and every line of that shape matches it. The fixed words that survive are its structural tokens, and each blanked position is a slot. Knowledge attached to the template is inherited by every matching event.

The klog template reads the E/W/I/F prefix once and tags itself ERROR. Every matching line inherits that severity.

HTTP codes are where the slot has to be found, not just read. Take a template whose structural tokens include status, GET, HTTP/, Completed. On first sight of the shape, it works out which slot holds the status code, guarded by a rule that a real HTTP marker sits within ±5 tokens of the candidate. Every event after that is a slot read and a range check, not a regex sweeping the line.

The kafka template has no HTTP words among its structural tokens, so it never gets an HTTP binding. Its size = 200 is safe. Not a heuristic that usually works. A shape that can't ask the question.

The same value 200 in two shapes. A Spring line with the HTTP marker "Completed" beside the number binds http_code = 200; a kafka "max.request.size = 200" line has no HTTP marker near it, so 200 stays an ordinary value.

The standard stack does this per event, when it should be done once per shape. Splunk sourcetype extractions, Datadog log pipelines, Elasticsearch grok rules: each wants a regex per format, kept current as the code changes. The deeper reason is they have no stable address to pin the answer to. A grouping that drifts between queries forces the work back onto the raw value, which can't separate the kafka 200 from an HTTP 200.

The open question I find more interesting: how much per-event work should move to the shape. Cost attribution has a per-shape answer: bill the template, not the line. So does retention: decide once whether a shape is worth keeping.

The runtime classifier modules are open source at github.com/log-10x/modules/tree/main/pipelines/run/modules/initialize; the HTTP config and its strict-marker adjacency check are at pipelines/run/initialize/httpCode/config.yaml.


Related: how a line gets assigned to a stable template and where the structural vocabulary comes from. A future post covers how templates also shrink the Bloom filters that route queries across S3-stored event archives.