Cutting Elasticsearch log storage without rewriting a single query

Share

A Kibana search for accounting returns zero hits. The word is right there in your logs, the query is correct, and you get nothing. The fix is L1ES, a native Elasticsearch plugin that teaches the cluster to read the encoded form in place. Search works without changing a single query, against a modeled 50 to 80% less stored volume, lossless.

We encode each log event the way databases have deduplicated repeated strings for decades. Recognize the format string behind every log.info(...) call as a template, store it once, and store each event as the template's hash plus its residue: the tokens that change from line to line. Those changing tokens are the template's variables.

The math is the point. A million events cost one template plus a million small residues instead of a million full lines.

The catch is that the encoding splits each event across two indices, template in one and residue in the other, and Elasticsearch has no idea they are the same document.

There were two ways out. Give the storage savings back by expanding every event before it lands, or teach Elasticsearch to see through the encoding without changing how anyone queries. We took the second. L1ES ships under Apache 2.0 at github.com/log-10x/elasticsearch-plugin; the encoder that writes the two indices, the Receiver, is the commercial half of the pair. The rest of this post is the decisions behind the plugin.

One thing I am taking on faith: the template hash is stable across queries, restarts, and customer environments. We get that by deriving the vocabulary from the source environment at compile time rather than inferring templates from runtime data the way Drain does, whose identities drift. That extraction is a separate story, linked below. This post is the query side only.

Three places to intercept; we picked the lowest

Three layers can hold transparent decoding. Proxy is too shallow, ingestion defeats the purpose, so we load into the JVM.

A query proxy in front of the cluster. It receives the JSON query, rewrites it, decodes the response. Easy to build, but a proxy sees the REST API, not the Lucene query plan. It can rewrite JSON, but it can't inject a TwoPhaseIterator or read SortedDocValues columns from a segment.

Expand every event during ingestion. Index the full event, store the encoded form alongside, and search works normally. This defeats the point: the events were encoded to cut the storage bill.

A native plugin in the Elasticsearch JVM, intercepting at the Lucene layer. It reaches the scorer and preserves the ingestion path. The engineering cost is higher because you are inside someone else's JVM, but it is the only option that works.

How the rewrite swaps query nodes

Before any query reaches Lucene, L1esQueryRewriteFilter (an ActionFilter on every search action) hands the query tree to L1esQueryRewriter. The rewriter walks the tree and swaps each text-matching node for an L1ES-aware version: MatchQueryBuilder becomes L1esMatchQueryBuilder. Compound nodes like BoolQueryBuilder recurse into their children; term, range, and wildcard pass through untouched. (OpenSearch 2.19.0 registers this as a search pipeline, but the rewriting is shared.)

Every rewritten node sorts the candidate templates into three states, and one rule generates all three. A query token can be satisfied two ways: by a word in the template's fixed text (its static tokens), or by one of the template's variable values. So a template is ruled out only when it has no variables AND its static tokens are missing a query token.

MUST_MATCH: every query token is already in the static tokens. CANNOT_MATCH: no variables and at least one query token missing. MIGHT_MATCH: a token is missing from the static tokens but the template has variables that could hold it. That last state makes the partition exhaustive, and it is the only one that needs the per-document check.

Two yes/no tests decide a template's state: are all query tokens in its static text, and does it have variables. All tokens present means MUST_MATCH either way; otherwise CANNOT_MATCH when there are no variables, or MIGHT_MATCH, the only state that checks a document, when a variable could hold the missing token.

Why l1x_tid is optional

L1esMatchQuery checks one thing per query: does the data index carry a keyword field, l1x_tid, holding each document's template hash?

If yes, the fast path fires. The plugin holds an in-memory snapshot of the templates (token index, bigram index, per-template static-token set, has-variables flag), rebuilt every five minutes, so every lookup is a hash-map read with no Elasticsearch round-trip.

If no, the slow path searches the internal l1es_dml template index directly. (DML, the Data Matching Library, maps encoded events back to their templates.) For an AND query it runs an OR-search and an AND-search; the OR set minus the AND set is exactly the partial candidates that need checking.

We could have made l1x_tid mandatory and deleted the slow path. Every ingestion pipeline we control adds the field anyway. I chose not to. A plugin that refuses to work until you change your index mapping sits in a backlog for six months waiting on a reindex. Any existing index with encoded data can install the plugin and search today on the slow path; add l1x_tid when latency matters, and the fast path kicks in on the next search with no restart.

Making the per-document check cheap

For a phrase like logged in, we want to know whether those two words sit next to each other inside a template without touching a document. That is the bigram index: it keys consecutive static tokens, so intersecting the phrase against the candidates classifies templates MUST_MATCH. When a phrase might cross a variable boundary, PhrasePredicate walks the token list with $ placeholders as wildcards.

For the MIGHT_MATCH documents that survive, L1esScorerWrapper wraps the inner Lucene scorer with a TwoPhaseIterator: a cheap approximation that over-collects with a TermInSetQuery, then an exact check.

Lucene already assigns every distinct template hash a small integer id, an ordinal, per index segment. The naive check looks the hash up through SortedDocValues.lookupOrd(), calls utf8ToString(), then HashMap.get() against the classification table: three string operations per document, repeated for the same string thousands of times in one traversal.

We cache at the ordinal instead, in a byte[] sized to the segment's ordinal count: 0 unclassified, 1 MUST_MATCH, 2 CANNOT_MATCH, 3 MIGHT_MATCH. The first document with a given ordinal does the full lookup and writes one byte; every later one is a bounds-checked array read.

Over-matching is acceptable for log search; silent drops are not. Any exception during the check returns true and keeps the document.

What L1ES doesn't handle

Single-token OR queries match at the template level and skip the two-phase scorer. Searching error returns every event from any template whose fixed text contains error, even when that event is about something else. These are over-matches, not wrong content. Multi-token AND and phrase queries stay precise; the per-document check verifies decoded content.

Query types that bypass the rewriter (term, wildcard, regexp) see encoded text directly. KQL generates match and match_phrase, which the rewriter handles, so most Kibana usage hits the rewritten path. After scoring, L1esFetchSubPhase rebuilds the original line into _source so Kibana Discover shows the real log. All of this is toggleable in config/l1es.yml. To decode the format outside Elasticsearch, there are standalone Java and JavaScript decoders.

L1ES targets ES 8.17.0, OpenSearch 2.19.0, Java 17+, Lucene 9.12.0. The source is Apache 2.0 at github.com/log-10x/elasticsearch-plugin; start with `L1esMatchQuery.java`. The call I keep coming back to is the optional-l1x_tid choice and the complexity it bought. Worth it, I think. Read the source and tell me I'm wrong.


Related: why the template hash stays stable, what else attaches to a template, where the symbol library comes from, the Splunk-side equivalent, and the same idea on ClickHouse in plain SQL.