FEBRUARY 3, 2026

5 MIN READ

SYLVAIN UTARD

Pruning Top-N Queries

Pruning Top-N Queries

A deep dive into DuckLake PR #668 and how Top-N dynamic filter pruning turns ORDER BY + LIMIT from full scans into metadata-driven execution.

Blog

One of the fastest ways to lose trust in a lakehouse is a query that looks simple but behaves catastrophically at scale.

SELECT *
FROM events
ORDER BY timestamp DESC
LIMIT 10;

On a traditional database, this feels trivial.
On a large Parquet-backed lake, it can mean scanning and sorting billions of rows just to return ten.

This is exactly the problem tackled by DuckLake PR #668, which introduces Top-N dynamic filter file pruning. It's a small-sounding change with outsized implications for interactive analytics on lakes.

Let's unpack what's going on.

Why ORDER BY + LIMIT Is Hard on a Lake

In a lakehouse, data is split across many Parquet files. Each file may contain:

  • Millions of rows
  • A bounded range of values per column (via min/max statistics)
  • No global ordering guarantees

Without pruning, a Top-N query forces the engine to:

  1. Read all files
  2. Extract the relevant column
  3. Sort all rows
  4. Keep only the top N

That's correct... but deeply inefficient.

The real issue isn't sorting.
It's not knowing which files can be skipped before we even start.

The Key Insight: File Statistics Are Already Telling Us the Answer

Parquet files store min/max statistics per column.
For a timestamp column, that means every file already knows:

  • The earliest timestamp it contains
  • The latest timestamp it contains

If you're asking for:

“the 10 most recent events”

then any file whose maximum timestamp is older than your current Top-N threshold is irrelevant.

Before PR #668, DuckLake didn't fully exploit this fact for Top-N queries.

What PR #668 Introduces

PR #668 adds Top-N-aware file pruning based on column statistics.

At a high level, the execution becomes:

  1. Detect a Top-N query

    • ORDER BY column [ASC|DESC] LIMIT N
  2. Collect per-file min/max statistics

    • Already present in DuckLake metadata
  3. Rank files by relevance

    • For ORDER BY timestamp DESC, files with higher max(timestamp) are more promising
  4. Scan files incrementally

    • Start with the most promising files
    • Maintain a dynamic Top-N threshold
    • Prune files whose value ranges can't beat the current worst candidate

This is where the “dynamic filter” part matters:
the pruning boundary tightens as the query runs.

A Concrete Example

Imagine three files:

Filemin(timestamp)max(timestamp)
A2023-01-012023-01-31
B2024-01-012024-01-31
C2025-01-012025-01-10

For:

ORDER BY timestamp DESC
LIMIT 10

DuckLake can now:

  1. Start with File C — it has the highest max(timestamp), so it's most likely to contain winners
  2. Scan File C — suppose it yields 8 matching rows, with the 8th-best at 2025-01-03
  3. Move to File B — its max(timestamp) of 2024-01-31 could still contribute
  4. Scan File B partially — after 2 more rows, we have our Top-10; the new cutoff is 2024-01-29
  5. Prune remaining rows in File B — any row older than 2024-01-29 can't improve our result
  6. Skip File A entirely — its max(timestamp) of 2023-01-31 can't beat 2024-01-29

The result: less I/O, less CPU, lower latency, lower costs.

Why This Matters Architecturally

The interesting thing about dynamic pruning isn't the speedup: it's the shift from plan-time decisions to runtime adaptation.

Traditional query optimizers behave like compilers: before execution, a cost-based optimizer (CBO) enumerates and scores many candidate plans (join orders, access paths, join algorithms) using statistics, then commits to a chosen strategy. If the stats are wrong (cardinality estimation is hard), you can end up locked into the wrong shape of work. Dynamic pruning inverts this. The execution learns as it runs: each batch of rows tightens the bound, which prunes more files, which changes what gets scanned next.

This has non-obvious implications:

Cost models become less fragile. Static optimizers guess wrong constantly: stale stats, skewed distributions, correlated columns. Dynamic pruning sidesteps this by measuring rather than estimating. The first file you scan tells you exactly what threshold you're working with.

Data layout suddenly matters a lot. If your files are time-partitioned and your queries are time-ordered, pruning is almost free. If your data is randomly distributed, you get no benefit. This creates a virtuous cycle: teams that think about file organization get compounding returns.

Exploratory queries become cheap. The old mental model was "exploration = expensive scans, production = optimized pipelines." With aggressive pruning, ad-hoc ORDER BY ... LIMIT queries can be as cheap as point lookups. That changes what you're willing to ask.

The Quiet Revolution

Remember the query we started with?

SELECT *
FROM events
ORDER BY timestamp DESC
LIMIT 10;

It still looks trivial.
But now, on a well-organized lake, it behaves trivially too.

That's the real win. Not a new keyword. Not a config flag. Just a query that finally does what you expected all along.

PR #668 is the kind of work that doesn't make headlines but it's exactly what moves lakehouses from "warehouse alternative" to "warehouse replacement".

The gap is closing. One pruned file at a time.

Share

Sylvain Utard, Co-Founder & CEO at Altertable

Sylvain Utard

Co-Founder & CEO

Seasoned leader in B2B SaaS and B2C. Scaled 100+ teams at Algolia (1st hire) & Sorare. Passionate about data, performance and productivity.

Stay Updated

Get the latest insights on data, AI, and modern infrastructure delivered to your inbox

Related Articles

Continue exploring topics related to this article

From STDIO to OAuth
OCTOBER 21ST, 2025
Sylvain Utard

From STDIO to OAuth

Engineering, Open Source

How MCP evolved from local stdio to OAuth 2.0 for cloud-scale AI, using Dynamic Client Registration for secure agent access.

READ ARTICLE
Upstreaming with AI
SEPTEMBER 23RD, 2025
Sylvain Utard

Upstreaming with AI

Open Source, Engineering, AI Agents

How we contributed 17 upstream PRs in 90 days—where AI accelerated our workflow, what we learned, and practical tips for open source success with AI assistance.

READ ARTICLE
Lessons from Search
JANUARY 13TH, 2026
Sylvain Utard

Lessons from Search

Performance, Architecture, Engineering

Real-time analytics faces the small-file problem search engines solved. DuckLake's tiered compaction brings those merge strategies to streaming analytics.

READ ARTICLE
Stop Batching Analytics
DECEMBER 30TH, 2025
Sylvain Utard

Stop Batching Analytics

Analytics, Architecture, Performance

Why we're forcing analytics through complex batch pipelines when append-only data should work like logs. The warehouse constraint that stopped making sense.

READ ARTICLE
Rethinking the Lakehouse
JULY 30TH, 2025
Yannick Utard

Rethinking the Lakehouse

Architecture, Performance, Data Stack

Breaking down our storage and query architecture: why we're leaning into Apache Iceberg and why DuckDB is emerging as our real-time query engine of choice.

READ ARTICLE
Altertable Logo

Wake Up To Insights

Join product, growth, and engineering teams enabling continuous discovery