One of the fastest ways to lose trust in a lakehouse is a query that looks simple but behaves catastrophically at scale.
SELECT *FROM eventsORDER BY timestamp DESCLIMIT 10;
On a traditional database, this feels trivial.
On a large Parquet-backed lake, it can mean scanning and sorting billions of rows just to return ten.
This is exactly the problem tackled by DuckLake PR #668, which introduces Top-N dynamic filter file pruning. It's a small-sounding change with outsized implications for interactive analytics on lakes.
Let's unpack what's going on.
Why ORDER BY + LIMIT Is Hard on a Lake
In a lakehouse, data is split across many Parquet files. Each file may contain:
- Millions of rows
- A bounded range of values per column (via min/max statistics)
- No global ordering guarantees
Without pruning, a Top-N query forces the engine to:
- Read all files
- Extract the relevant column
- Sort all rows
- Keep only the top N
That's correct... but deeply inefficient.
The real issue isn't sorting.
It's not knowing which files can be skipped before we even start.
The Key Insight: File Statistics Are Already Telling Us the Answer
Parquet files store min/max statistics per column.
For a timestamp column, that means every file already knows:
- The earliest timestamp it contains
- The latest timestamp it contains
If you're asking for:
“the 10 most recent events”
then any file whose maximum timestamp is older than your current Top-N threshold is irrelevant.
Before PR #668, DuckLake didn't fully exploit this fact for Top-N queries.
What PR #668 Introduces
PR #668 adds Top-N-aware file pruning based on column statistics.
At a high level, the execution becomes:
-
Detect a Top-N query
ORDER BY column [ASC|DESC] LIMIT N
-
Collect per-file min/max statistics
- Already present in DuckLake metadata
-
Rank files by relevance
- For
ORDER BY timestamp DESC, files with highermax(timestamp)are more promising
- For
-
Scan files incrementally
- Start with the most promising files
- Maintain a dynamic Top-N threshold
- Prune files whose value ranges can't beat the current worst candidate
This is where the “dynamic filter” part matters:
the pruning boundary tightens as the query runs.
A Concrete Example
Imagine three files:
| File | min(timestamp) | max(timestamp) |
|---|---|---|
| A | 2023-01-01 | 2023-01-31 |
| B | 2024-01-01 | 2024-01-31 |
| C | 2025-01-01 | 2025-01-10 |
For:
ORDER BY timestamp DESCLIMIT 10
DuckLake can now:
- Start with File C — it has the highest
max(timestamp), so it's most likely to contain winners - Scan File C — suppose it yields 8 matching rows, with the 8th-best at
2025-01-03 - Move to File B — its
max(timestamp)of2024-01-31could still contribute - Scan File B partially — after 2 more rows, we have our Top-10; the new cutoff is
2024-01-29 - Prune remaining rows in File B — any row older than
2024-01-29can't improve our result - Skip File A entirely — its
max(timestamp)of2023-01-31can't beat2024-01-29
The result: less I/O, less CPU, lower latency, lower costs.
Why This Matters Architecturally
The interesting thing about dynamic pruning isn't the speedup: it's the shift from plan-time decisions to runtime adaptation.
Traditional query optimizers behave like compilers: before execution, a cost-based optimizer (CBO) enumerates and scores many candidate plans (join orders, access paths, join algorithms) using statistics, then commits to a chosen strategy. If the stats are wrong (cardinality estimation is hard), you can end up locked into the wrong shape of work. Dynamic pruning inverts this. The execution learns as it runs: each batch of rows tightens the bound, which prunes more files, which changes what gets scanned next.
This has non-obvious implications:
Cost models become less fragile. Static optimizers guess wrong constantly: stale stats, skewed distributions, correlated columns. Dynamic pruning sidesteps this by measuring rather than estimating. The first file you scan tells you exactly what threshold you're working with.
Data layout suddenly matters a lot. If your files are time-partitioned and your queries are time-ordered, pruning is almost free. If your data is randomly distributed, you get no benefit. This creates a virtuous cycle: teams that think about file organization get compounding returns.
Exploratory queries become cheap. The old mental model was "exploration = expensive scans, production = optimized pipelines." With aggressive pruning, ad-hoc ORDER BY ... LIMIT queries can be as cheap as point lookups. That changes what you're willing to ask.
The Quiet Revolution
Remember the query we started with?
SELECT *FROM eventsORDER BY timestamp DESCLIMIT 10;
It still looks trivial.
But now, on a well-organized lake, it behaves trivially too.
That's the real win. Not a new keyword. Not a config flag. Just a query that finally does what you expected all along.
PR #668 is the kind of work that doesn't make headlines but it's exactly what moves lakehouses from "warehouse alternative" to "warehouse replacement".
The gap is closing. One pruned file at a time.





