is grep enough?
When a coding agent explores a large, unfamiliar codebase, are basic text tools (grep/read) enough — or does it need something fast and light (structural, tree-sitter) or something more authoritative and accurate (semantic, LSP)? This measures the three side by side and lets you check the answer yourself.
What this shows
The same exploration task, given to an agent with three rungs of navigation power — plain text search (baseline), fast-light structural (grove), and authoritative semantic (lsp) — across a task-complexity ladder. The question is where on that ladder the extra power stops paying for itself. Everything below links to the raw run that produced it.
Filter
Coverage & integrity
Run status per cell — three segments are baseline · grove · lsp. Filled = harvested, empty = pending, hatched = blocked/DNF. Click a cell to inspect; click a repo or rung to filter. This is the honest state of completeness, shown before any chart.
Metrics
Five metric families, each a row of small charts — one per rung, the three arms side by side within each. Small-multiples, not one busy chart: every panel keeps the same honest y-axis from zero, with one dot per repo and a tick at the per-rung median. Filter above to focus a rung or repo.
Cell detail
Free compare (secondary)
Pick any two harvested cells — e.g. grove on L2 vs L3 redis — to inspect how an arm scales with task complexity, or how two arms differ on the same task. Aligned metrics on top, the two transcripts in parallel below.
Methodology & provenance
How the numbers above were produced, and what protects their fairness. Everything here is a standing claim you can check against the evidence linked throughout the page.
The genesis wall
Prompts and their reference answer keys are generated offline, before any arm runs. A running arm sees only the bare prompt — never the reference key, the rationale, or the pinned source under experiment/repos/. Judging reads the keys; running never does. The keys are judge-only and appear on this page only as post-hoc key revisions in a cell's detail, never as the answer itself.
Blind judging
Each cell's three answers are scrubbed to A / B / C with the arm→letter mapping withheld, graded against the reference key on grounding (do the cited file:line anchors resolve in pinned source?) and completeness (does it cover the key's required spine?), and only un-blinded to record the score. Where the key itself was wrong, it is corrected in place and the correction is shown with its cite — proof the grader corrected itself, not the arms.
Cite-link verification
Every file:line in a transcript links to the GitHub blob at that repo's pinned SHA. The build doesn't just link them — it re-resolves them against the pinned source: a cite is confirmed when its file is located (exact path, or a unique basename match in the tree) and the line is within the file. The result across all harvested cells:
—
Pricing
Every dollar figure on this page is the billed total_cost_usd reported by the run itself — not a recomputed list-price estimate. The table below is the public list price for the model used (—), shown so a reader can sanity-check the billed figures against the token split. n=1 per cell; cost is a direction, not a benchmark.
Data sources
The feed is a pure function of committed evidence, synthesized by site/build.mjs. The cell ledger is written only through the validated statectl CLI — never hand-edited. These are the exact files behind the current view:
Reproduce this page
The feed is deterministic — re-running build.mjs against the same evidence reproduces site/data/ byte-for-byte (only the stamped SHA/timestamp differ). Rebuild and serve locally, then diff against what's published:
How to trust this
- Every number is recomputed from a run's own stream-json; click any transcript to read the full reasoning trail and the raw evidence path.
- The engagement line per arm proves it used its capability (
bash/grove/lsp > 0) — a fairness gate, not an assertion. - Answer quality was blind-judged (A/B/C, mapping withheld); where the reference key was wrong it is corrected in-place (key revisions), shown in the cell detail.