Stop extracting log fields on every event
Open any Kubernetes control-plane log and you'll see lines like E1023 14:30:45.123456 controller.go:123 connection refused. That E is the severity. There's no word ERROR in the line. klog glues severity to the front of the timestamp as a single character. Grep every event for ERROR and you miss all of them. The error-rate dashboard reads zero while the controller fails.
Here's my claim: the information you need about an event usually isn't in its values. It's in the shape the event came from. A 3-digit number could be an HTTP status code, or it could be max.request.size = 200 in a kafka config dump. The value 200 can't tell you which. The shape can. So attach the work to the shape, not the value.
JavaScript engines already solved a structurally identical problem. Object shape isn't declared in source, so they can't optimize access object by object. V8 records each object's shape as a hidden class (PyPy and CPython 3.11 do the same thing). On first sight of that class it finds where a property lives and caches the offset at the access site. Discover once, attach once, every instance inherits.
A template is a log line with its variable parts blanked out
We apply the same architecture to log events. A template is the log line with its changing parts blanked out. E$TIME controller.go:$N connection refused is one template, and every line of that shape matches it. The fixed words that survive are its structural tokens, and each blanked position is a slot. Knowledge attached to the template is inherited by every matching event.
The klog template reads the E/W/I/F prefix once and tags itself ERROR. Every matching line inherits that severity.
HTTP codes are where the slot has to be found, not just read. Take a template whose structural tokens include status, GET, HTTP/, Completed. On first sight of the shape, it works out which slot holds the status code, guarded by a rule that a real HTTP marker sits within ±5 tokens of the candidate. Every event after that is a slot read and a range check, not a regex sweeping the line.
The kafka template has no HTTP words among its structural tokens, so it never gets an HTTP binding. Its size = 200 is safe. Not a heuristic that usually works. A shape that can't ask the question.
The standard stack does this per event, when it should be done once per shape. Splunk sourcetype extractions, Datadog log pipelines, Elasticsearch grok rules: each wants a regex per format, kept current as the code changes. The deeper reason is they have no stable address to pin the answer to. A grouping that drifts between queries forces the work back onto the raw value, which can't separate the kafka 200 from an HTTP 200.
The open question I find more interesting: how much per-event work should move to the shape. Cost attribution has a per-shape answer: bill the template, not the line. So does retention: decide once whether a shape is worth keeping.
The runtime classifier modules are open source at github.com/log-10x/modules/tree/main/pipelines/run/modules/initialize; the HTTP config and its strict-marker adjacency check are at pipelines/run/initialize/httpCode/config.yaml.
Related: how a line gets assigned to a stable template and where the structural vocabulary comes from. A future post covers how templates also shrink the Bloom filters that route queries across S3-stored event archives.