We built one polyglot log scanner, then wrote three native ones anyway
Production source is polyglot; five-plus languages is normal. To extract a symbol vocabulary from it, you write one scanner per language or one polyglot scanner. We built the polyglot one, then wrote three native scanners anyway. This post is why.
The vocabulary is what those scanners produce: identifiers and literal strings pulled from code (class names, severity levels, the text inside log.info(...) calls), each tagged with where it lives. That tag is what tells two identically-shaped lines apart; where the vocabulary comes from makes that case in full. If line A came from a method whose source contains the literal connection reset and line B did not, the tag attributes the line to the path that could have emitted it.
One tree walker, one YAML per language
ANTLR is a parser generator: feed it a grammar and it turns source files into a parse tree. We walk that tree and grab the text at node types we care about. The shipped configs live at config/pipelines/compile/scanners/antlr/, one per language. The Java config, abridged:
# config/pipelines/compile/scanners/antlr/java.yaml
antlr:
lang: java
parserClass: com.log10x.antlr.generated.java.Java9Parser
rootRule: compilationUnit
rule:
- name: normalClassDeclaration
context: class
recursive: false
capture: allSymbols
- name: methodInvocation
context: method_invoke
recursive: true
capture: literalsOnlyEach rule pairs a grammar rule name with a context and a capture mode. Context is the symbol's label (CLASS, ENUM, METHOD_INVOKE). Capture mode picks the tokens: literalsOnly takes quoted strings, allSymbols also takes identifiers. The recursive flag sets the depth: false grabs the class name, true takes every literal inside.
The Go config is identical in shape: a different parser class and rule names over the same taxonomy and walker. Adding a language means a YAML file and a grammar, no Java code.
Rules can also span multiple AST nodes. Python enums are the clean example: you want member names only from classes extending Enum. A tag/ifTag pair in python.yaml gates the rule on a matching parent.
Why three languages got a dedicated scanner
Alongside the generic ANTLR scanner, three languages get a dedicated scanner that calls their native parser: Java has javaParser (the Java compiler API), Python has pythonAST (the built-in ast), Scala has scalameta. The dedicated one wins unless disabled.
We built the ANTLR scanner first, for Java. It worked until we ran it against a production-scale codebase. ANTLR does real work per file: lexer pass, parser pass, tree walk, scope tracking, tag resolution. That compounds at scale, and compile times ran from minutes to hours.
So we wrote javaParser. The Java compiler API is faster because the JDK has spent twenty years optimizing it for the workload. It also resolves types and follows imports, so it knows logger is an SLF4J instance, not a string variable.
I'd make the same call again: if a language has a usable compiler API, use it. Performance first, semantic precision second. ANTLR carries Go, C++, C#, and JavaScript from one iterateTree() routine plus a folder of YAML, and stays a fallback when a native build fails.
TypeScript is the open one. Instead of hand-writing its YAML or building a tsc scanner, we're handing the schema and grammar to a frontier model to see if it writes the config itself. That is a later post.
The scanner configs live in the open-source config repository, alongside the C++ and Scala grammars; the other languages' ANTLR parsers ship pre-generated in the engine. Docs are at doc.log10x.com/compile/scanner/antlr. Related: where the symbol library actually comes from, why every log line gets a stable identity, and stop extracting log fields per event.