JUNE 15, 2026

11 MIN READ

FLORIAN VALEYE

Architecture, Altertable Agent, Performance

Know What a Query Reads Before It Runs

Row estimates only hint at what an agent will read. Bytes pin it down. Our explain returns them from catalog metadata, before the engine touches a file.

Share

Every query has a footprint. On a lakehouse, that footprint is bytes, the data the engine must read to answer it. Bytes aren't the whole bill, but they drive it: every byte scanned drags compute and network with it.

An agent will run any query you let it run, and it can't see what that query reads before it runs. The engine might prune the read to a sliver or scan the whole table, so a 200 GB query and a 40 MB one look identical going in.

Row estimates hint at what you'll read, but they don't pin it down. Pruning and projection can swing the read by orders of magnitude, and rows see neither. Before an agent runs anything, what it needs to hear is "this scans 12 GB," not "this returns about 10 million rows."

The fix is a contract, bytes before rows: every query tells you what it will read, in advance. Seen that early, the number becomes a choice. You tighten a filter, drop a column, or decide the question isn't worth its scan.

This read estimate is live today, both as a tool agents call and in the SQL editor.

The Altertable SQL editor showing the read estimate below a statement before it runs

EXPLAIN tells you shape, not size

EXPLAIN is the preview most SQL engines offer: prefix a statement with it and, in most of them, you get back the plan the engine would execute, without running anything. DuckDB is one of them.

The plan is operators and cardinalities. Operators tell you the shape of the work: scan, filter, hash join, project. Cardinalities tell you a row guess, which is almost always wrong on a lake (stale stats, skewed distributions, correlated columns).

Watch it do exactly that, on an events table in DuckLake (the open lakehouse format we build on), id running 1 to 3000:

EXPLAIN SELECT id FROM events WHERE id > 2000;

┌───────────────────────────┐
│       DUCKLAKE_SCAN       │
│    ────────────────────   │
│       Table: events       │
│      Projections: id      │
│      Filters: id>2000     │
│                           │
│         ~600 rows         │
└───────────────────────────┘

The plan sees the projection and the filter, then guesses ~600 rows where the true answer is 1,000, off by 40% on a one-column predicate over data the engine itself just wrote. And nothing in the box says how many bytes it will read.

What you read is the compressed bytes the engine must fetch: from object storage on a cold read, or from cache on a warm one. Caching changes what those bytes cost, not how many there are, and the count is the part you can know in advance.

Two queries with identical row estimates can differ by three orders of magnitude in bytes, depending on which columns they touch and which files survive pruning: one narrow column in one surviving file of many is a thousandth of the read.

The disconnect is measured, not theoretical. Recent work on LLM-written SQL finds correct queries varying severalfold in the bytes they read, wall-clock time a poor proxy for the read, and models that reason reading markedly fewer bytes.

For a human running an ad-hoc query, this gap is an annoyance: you notice the wait, kill the statement, rewrite. For an agent firing tool calls in a loop, it multiplies. Agents don't run one query, they run thousands, and a single badly shaped one against a wide table can out-read everything else they did that day.

At that scale, an agent that writes SQL has to know a query's impact before running it, and adapt.

It can't do that by feel, because an agent has no clock, only what its tools tell it. Guessing doesn't close the gap either: even after training to predict their own budget, agents land inside their own estimate less than half the time.

So we changed what the tools tell it: the bytes, not a guess at rows.

An EXPLAIN that answers in bytes

Our POST /explain, served by the Altertable Rust-based API and exposed to agents as the explain_sql MCP tool, takes a SQL statement and returns, before the engine reads a single row:

total_files and total_bytes per table (the worst case).
scanned_files_estimate and scanned_bytes_estimate after filter and column pruning (the predicted case).
Normalized filter strings, so the agent can reason about what's being applied, from range comparisons to transform-applied filters like year(ts) = 2024.
The parsed plan, optionally, for callers that want the tree.

The shape is the contract: worst case next to predicted case, per table and in total, plus the filter it pruned on. For SELECT id FROM events WHERE id > 2000, the events entry in the response reads:

1
2
3
4
5
6
7
8
{
  "table_name": "events",
  "filters": "id > 2000",
  "total_files": 3,
  "total_bytes": 1872102,
  "scanned_files_estimate": 1,
  "scanned_bytes_estimate": 4044
}

One file of three survives, and only the id column is read from it: 4 KB out of 1.9 MB, about a 460th of the worst case.

The mechanism is small. We run DuckDB's EXPLAIN, walk the plan for every DUCKLAKE_SCAN, and lift three things off each scan:

the table
the filters
the projected columns

Then we ask DuckLake's metadata, not the data, what those three things imply.

Predicting what a query reads is itself a query. It runs against the catalog, and this is where DuckLake's design does the heavy lifting: the catalog is a regular SQL database, Postgres in our case, so the metadata lives in indexed tables, not in manifest files on object storage.

File counts and sizes come from ducklake_data_file. Per-column bytes and the min/max that decide which files survive a predicate come from ducklake_file_column_stats, and partition pruning from ducklake_file_partition_value.

Every input is already there from previous writes. No Parquet footer opens, no object-storage range read fires.

And it stays cheap when an agent leans on it. The catalog is sized by file count, not row count: ducklake_file_column_stats holds one row per file per column, a few megabytes against the terabytes they describe. A point query against those indexed tables is fast enough that the SQL editor reruns it as you type, and a dozen estimates cost less than one accidental cold scan.

None of the parts are new

We didn't invent this. We were inspired by it. BigQuery's dry run has shown the bytes a query will process, from metadata, for years. Trino's EXPLAIN (TYPE IO) reports an estimated size per table before a query runs. Engines have pruned lakes from per-file statistics for as long as lakes have had statistics, and DuckLake's own specification spells out the exact file-pruning query.

What none of them hands back is the answer in the shape a caller needs. BigQuery's number lives inside an engine you don't control. Trino's outputSizeInBytes is the optimizer's logical estimate, the in-memory size it uses to choose a plan, and it degrades to nothing when a connector has no stats.

DuckLake documents the pruning as internal transparency and stops there. The closest anyone comes in the DuckDB world is reading file skips back out of EXPLAIN ANALYZE, which means running the query first and counting files, not bytes.

What we return is narrower, and we haven't found it packaged this way elsewhere: the compressed on-disk bytes of the projected columns in the files that survive pruning, computed as one query against open catalog tables, before the engine reads anything.

Filters drop files, projections drop columns

That pruning works is table stakes. The interesting property is that the two kinds multiply: a filter drops whole files whose min/max can't match, a projection drops columns inside every surviving file.

A worked example on a twenty-column events table, 3,000 rows across three checkpointed files holding id ranges 1–1000, 1001–2000, and 2001–3000:

Query	Files read (of 3)	Bytes read
`SELECT * FROM events`	3	1.9 MB (baseline)
`SELECT * FROM events WHERE id > 2000`	1	624 KB (a third)
`SELECT id FROM events`	3	12 KB (a 150th)
`SELECT id FROM events WHERE id > 2000`	1	4 KB (a 460th)

File counts are exact: id > 2000 provably leaves only the 2001–3000 file. The byte figures come from one DuckLake build and shift with compression and version, so read the ratios: a filter cuts the table to a third, projecting the narrow id column out of twenty wider ones cuts it to a 150th, and the two together reach a 460th.

Two edges make this foresight rather than guesswork. A predicate that reads nothing returns nothing: WHERE id = 5000 prunes every file, and the endpoint reports nothing to scan. A predicate that buys nothing is named as such: WHERE id > 0 matches every row, so the endpoint drops the filter from the estimate rather than pretending it pruned. The agent learns which filters change the read and which are theater.

Preview, compare, warn

Three behaviors open up the moment what a query will read is a structured field the agent can check.

Previewing the read. Before running the query, the agent calls explain_sql and decides. A small read runs straight through. A large one prompts a rewrite: tighter filter, fewer columns, or a different question.

Picking between equivalent rewrites. The same answer can be reached by several queries, and an LLM will happily draft three shapes of it. With an explain_sql per draft, the agent picks the one that reads the fewest bytes. Optimization moves from "trust the optimizer" to "compare what each one reads," and since the tool accepts up to ten statements per call, a batch of rewrites is one round trip.

Warning the human before the wait. If the estimate crosses a threshold, the agent surfaces it. "This is going to scan ~40 GB and may take a minute. Continue?" That's something an agent should ask.

None of this is specific to one host. The endpoint is HTTP, and the read estimate is a property of the lakehouse, not of who's asking.

What it still can't predict

Honest accounting:

Joins fan out unpredictably. The estimate covers the scan inputs, not the join output.
Aggregations collapse rows. They cut output, not scan volume.
Late materialization recovers bytes. The estimate stays conservative, not wrong.
Fallbacks assume the full read. Anything the stats can't follow collapses to total bytes, so uncertainty becomes a conservative estimate, not a false sense of precision.

The estimate is the scan, known in advance. Everything above it is the engine's job, and dynamic pruning keeps tightening the bound during execution.

One failure mode matters: an over-count is an upper bound, not a verdict. An agent that treats every worst case as a hard no will skip queries it could have run for almost nothing.

A loose estimate is a prompt to tighten the predicate or probe with a small LIMIT, not a reason to walk away.

Ask for permission, not forgiveness

There is an economic reason surfaces like this are rare. Some platforms bill by data scanned, and the rest bill by the compute time the scan occupies. Either way the naive, over-reading query is the profitable one, and a read estimate works against the meter.

Our incentive points the other way: this lakehouse is AI-native, agents are its heaviest users, and the estimate exists to make them faster. An agent that sees the read up front writes quicker, cheaper queries, and the savings land with the person paying.

What it changes is the posture, for the person and the agent alike. A person at the editor rewrites before the wait. An agent in a loop checks between drafts, thousands of times a day. Different scales, same default: whoever runs queries blind asks for forgiveness. Whoever checks the read first asks for permission.

Knowing what a question will read before you ask it isn't optimization: it's judgment.

Share

Florian Valeye

Staff Backend Engineer

Data platform expert with deep expertise in distributed systems and modern data infrastructure. Delta Lake committer and contributor, previously scaled data pipelines at Back Market.