is grep enough?

When a coding agent explores a large, unfamiliar codebase, are basic text tools (grep/read) enough — or does it need something fast and light (structural, tree-sitter) or something more authoritative and accurate (semantic, LSP)? This measures the three side by side and lets you check the answer yourself.

baseline · text grove · structural lsp · semantic color = arm identity, never quality. n=1 per cell.

What this shows

The same exploration task, given to an agent with three rungs of navigation power — plain text search (baseline), fast-light structural (grove), and authoritative semantic (lsp) — across a task-complexity ladder. The question is where on that ladder the extra power stops paying for itself. Everything below links to the raw run that produced it.

Filter

Coverage & integrity

Run status per cell — three segments are baseline · grove · lsp. Filled = harvested, empty = pending, hatched = blocked/DNF. Click a cell to inspect; click a repo or rung to filter. This is the honest state of completeness, shown before any chart.

Metrics

Five metric families, each a row of small charts — one per rung, the three arms side by side within each. Small-multiples, not one busy chart: every panel keeps the same honest y-axis from zero, with one dot per repo and a tick at the per-rung median. Filter above to focus a rung or repo.

baseline grove lsp dot = one repo · tick = per-rung median · n=1 per cell

Free compare (secondary)

Pick any two harvested cells — e.g. grove on L2 vs L3 redis — to inspect how an arm scales with task complexity, or how two arms differ on the same task. Aligned metrics on top, the two transcripts in parallel below.

Methodology & provenance

How the numbers above were produced, and what protects their fairness. Everything here is a standing claim you can check against the evidence linked throughout the page.

The genesis wall

Prompts and their reference answer keys are generated offline, before any arm runs. A running arm sees only the bare prompt — never the reference key, the rationale, or the pinned source under experiment/repos/. Judging reads the keys; running never does. The keys are judge-only and appear on this page only as post-hoc key revisions in a cell's detail, never as the answer itself.

Blind judging

Each cell's three answers are scrubbed to A / B / C with the arm→letter mapping withheld, graded against the reference key on grounding (do the cited file:line anchors resolve in pinned source?) and completeness (does it cover the key's required spine?), and only un-blinded to record the score. Where the key itself was wrong, it is corrected in place and the correction is shown with its cite — proof the grader corrected itself, not the arms.

Cite-link verification

Every file:line in a transcript links to the GitHub blob at that repo's pinned SHA. The build doesn't just link them — it re-resolves them against the pinned source: a cite is confirmed when its file is located (exact path, or a unique basename match in the tree) and the line is within the file. The result across all harvested cells:

Pricing

Every dollar figure on this page is the billed total_cost_usd reported by the run itself — not a recomputed list-price estimate. The table below is the public list price for the model used (), shown so a reader can sanity-check the billed figures against the token split. n=1 per cell; cost is a direction, not a benchmark.

Data sources

The feed is a pure function of committed evidence, synthesized by site/build.mjs. The cell ledger is written only through the validated statectl CLI — never hand-edited. These are the exact files behind the current view:

Reproduce this page

The feed is deterministic — re-running build.mjs against the same evidence reproduces site/data/ byte-for-byte (only the stamped SHA/timestamp differ). Rebuild and serve locally, then diff against what's published:


    

How to trust this