Pruning Top-N Queries

A deep dive into DuckLake PR #668 and how Top-N dynamic filter pruning turns ORDER BY + LIMIT from full scans into metadata-driven execution.

One of the fastest ways to lose trust in a lakehouse is a query that looks simple but behaves catastrophically at scale.

1
2
3
4
SELECT *
FROM events
ORDER BY timestamp DESC
LIMIT 10;

On a traditional database, this feels trivial.
On a large Parquet-backed lake, it can mean scanning and sorting billions of rows just to return ten.

This is exactly the problem tackled by DuckLake PR #668, which introduces Top-N dynamic filter file pruning. It's a small-sounding change with outsized implications for interactive analytics on lakes.

Let's unpack what's going on.

Why ORDER BY + LIMIT Is Hard on a Lake

In a lakehouse, data is split across many Parquet files. Each file may contain:

Millions of rows
A bounded range of values per column (via min/max statistics)
No global ordering guarantees

Without pruning, a Top-N query forces the engine to:

Read all files
Extract the relevant column
Sort all rows
Keep only the top N

That's correct... but deeply inefficient.

The real issue isn't sorting.
It's not knowing which files can be skipped before we even start.

The Key Insight: File Statistics Are Already Telling Us the Answer

Parquet files store min/max statistics per column.
For a timestamp column, that means every file already knows:

The earliest timestamp it contains
The latest timestamp it contains

If you're asking for:

“the 10 most recent events”

then any file whose maximum timestamp is older than your current Top-N threshold is irrelevant.

Before PR #668, DuckLake didn't fully exploit this fact for Top-N queries.

What PR #668 Introduces

PR #668 adds Top-N-aware file pruning based on column statistics.

At a high level, the execution becomes:

Detect a Top-N query
- ORDER BY column [ASC|DESC] LIMIT N
Collect per-file min/max statistics
- Already present in DuckLake metadata
Rank files by relevance
- For ORDER BY timestamp DESC, files with higher max(timestamp) are more promising
Scan files incrementally
- Start with the most promising files
- Maintain a dynamic Top-N threshold
- Prune files whose value ranges can't beat the current worst candidate

This is where the “dynamic filter” part matters:
the pruning boundary tightens as the query runs.

A Concrete Example

Imagine three files:

File	min(timestamp)	max(timestamp)
A	2023-01-01	2023-01-31
B	2024-01-01	2024-01-31
C	2025-01-01	2025-01-10

For:

1
2
ORDER BY timestamp DESC
LIMIT 10

DuckLake can now:

Start with File C — it has the highest max(timestamp), so it's most likely to contain winners
Scan File C — suppose it yields 8 matching rows, with the 8th-best at 2025-01-03
Move to File B — its max(timestamp) of 2024-01-31 could still contribute
Scan File B partially — after 2 more rows, we have our Top-10; the new cutoff is 2024-01-29
Prune remaining rows in File B — any row older than 2024-01-29 can't improve our result
Skip File A entirely — its max(timestamp) of 2023-01-31 can't beat 2024-01-29

The result: less I/O, less CPU, lower latency, lower costs.

Why This Matters Architecturally

The interesting thing about dynamic pruning isn't the speedup: it's the shift from plan-time decisions to runtime adaptation.

Traditional query optimizers behave like compilers: before execution, a cost-based optimizer (CBO) enumerates and scores many candidate plans (join orders, access paths, join algorithms) using statistics, then commits to a chosen strategy. If the stats are wrong (cardinality estimation is hard), you can end up locked into the wrong shape of work. Dynamic pruning inverts this. The execution learns as it runs: each batch of rows tightens the bound, which prunes more files, which changes what gets scanned next.

This has non-obvious implications:

Cost models become less fragile. Static optimizers guess wrong constantly: stale stats, skewed distributions, correlated columns. Dynamic pruning sidesteps this by measuring rather than estimating. The first file you scan tells you exactly what threshold you're working with.

Data layout suddenly matters a lot. If your files are time-partitioned and your queries are time-ordered, pruning is almost free. If your data is randomly distributed, you get no benefit. This creates a virtuous cycle: teams that think about file organization get compounding returns.

Exploratory queries become cheap. The old mental model was "exploration = expensive scans, production = optimized pipelines." With aggressive pruning, ad-hoc ORDER BY ... LIMIT queries can be as cheap as point lookups. That changes what you're willing to ask.

The Quiet Revolution

Remember the query we started with?

1
2
3
4
SELECT *
FROM events
ORDER BY timestamp DESC
LIMIT 10;

It still looks trivial.
But now, on a well-organized lake, it behaves trivially too.

That's the real win. Not a new keyword. Not a config flag. Just a query that finally does what you expected all along.

PR #668 is the kind of work that doesn't make headlines but it's exactly what moves lakehouses from "warehouse alternative" to "warehouse replacement".

The gap is closing. One pruned file at a time.

Share