JANUARY 13, 2026

5 MIN READ

SYLVAIN UTARD

Lessons from Search

Lessons from Search

Real-time analytics faces the small-file problem search engines solved. DuckLake's tiered compaction brings those merge strategies to streaming analytics.

Listen to this article (Gen-AI)

0:00
0:00
Blog

Modern analytics systems love to talk about real-time ingestion. In practice, many of them still behave like batch systems wearing a streaming trench coat.

If you ingest events continuously, you inevitably produce many small files. And once you do, every query, every scan, every merge starts paying an invisible tax: file metadata, I/O amplification, cache churn, scheduler overhead. We need to stop forcing analytics through complex batch pipelines.

This problem is not new.

Search engines have been fighting it for decades.

Recently, DuckLake gained a small but important capability: the ability to filter which files are eligible for compaction based on their size. DuckLake is DuckDB's newly introduced integrated data lake and catalog format. This landed via the addition of min_file_size and max_file_size parameters to ducklake_merge_adjacent_files, a change we contributed upstream to DuckLake.

This post explains why this matters, where the idea comes from, and how tiered compaction changes the economics of real-time analytics.

The Small-File Problem Is Structural

Any pipeline that turns streamed writes into object storage files (Kafka/CDC sinks, event collectors, “streaming” lake ingestion, etc.) without integrated, size-aware compaction—will eventually produce files that are:

  • too small to be efficient
  • too numerous to scan cheaply
  • unevenly distributed in time

You can delay this with buffering. You can hide it with batching.

But if you truly want low-latency ingestion, small files are not a bug: they're a consequence.

(Contrast: LSM-tree storage engines like RocksDB or Cassandra do integrate compaction into the write path; this post is about lake-style storage where ingestion pushes immutable files into passive storage.)

The real question isn't how to avoid small files.

It's how to absorb them without destroying read performance.

Search Engines Solved This a Long Time Ago

Before analytics engines rediscovered “streaming,” search engines already lived there.

Lucene-based systems like Elasticsearch treat every flush as an immutable segment. Indexing is continuous, queries are concurrent, and the index is always in motion.

The key insight was this:

You don't merge everything all the time.
You merge similar things, incrementally, in tiers.

Lucene's tiered merge policy ensures that:

  • small segments merge quickly
  • large segments merge rarely
  • I/O amplification is bounded
  • merges don't cascade uncontrollably

This is what allows search engines to ingest thousands of writes per second while staying queryable.

DuckLake Is Getting the Same Primitive

DuckLake stores data in immutable files. That's a great property: immutable data is simple, safe, and cache-friendly.

But immutability only works at scale if you have good compaction.

The recent DuckLake change adds two optional parameters to ducklake_merge_adjacent_files:

  • min_file_size
  • max_file_size

This seems small. It isn't.

It allows you to selectively compact files based on size, which is the missing lever for tiered strategies.

When combined with the existing target_file_size parameter, you can control both:

  • which files participate in a merge (min_file_size / max_file_size)
  • how large the resulting merged file should be (target_file_size)

A Tiered Compaction Strategy (Concrete Example)

Here's a practical, Lucene-inspired strategy for streaming ingestion:

  • Tier 0Tier 1: Merge small files (< 1MB) into ~10MB files
  • Tier 1Tier 2: Merge medium files (1MB-10MB) into ~50MB files
  • Tier 2Tier 3: Merge large files (10MB-50MB) into ~200MB files

Each tier uses min_file_size and max_file_size to filter candidates, then target_file_size to cap the output. This gives you clear invariants:

  • You never merge files that are wildly different in size
  • You cap the output size of each merge
  • You bound I/O amplification per tier
  • You prevent "merge storms"

This is not theoretical: this exact approach has powered large-scale search infrastructure for years.

Why This Matters for Analytics (Not Just Storage)

Analytics workloads are different from search, but the ingestion problem is the same.

Without tiered compaction:

  • small files linger too long
  • scans touch too many objects
  • background merges compete with queries
  • latency becomes unpredictable

With tiered compaction:

  • recent data compacts quickly
  • historical data stabilizes
  • scans become progressively cheaper
  • compaction work (CPU + I/O) is spread out over time, avoiding latency spikes

This is especially important for incremental analytics: dashboards, funnels, feature tracking, real-time metrics.

You want:

  • fresh data now
  • stable data cheap
  • no global reprocessing

Tiered compaction gives you that shape.

Coming from the Search World

Before working on analytics systems, I spent years in search:

  • at Exalead
  • then as the first employee at Algolia, where I led the engineering team

Segment merging, compaction pressure, streamed write logs: this was daily reality.

What's exciting today is seeing these battle-tested ideas finally land in analytical systems like DuckLake. See how DuckLake enables hybrid compute architectures.

Not as academic features but as practical primitives engineers can build on.

Compaction Is a Policy, Not a Background Task

The biggest mental shift is this:

Compaction is not a cleanup job.
It is part of your ingestion strategy.

By exposing size-based filtering, DuckLake lets you:

  • encode intent
  • control amplification
  • adapt to workload shape

This opens the door to:

  • tier-aware scheduling
  • ingestion-rate-adaptive merging
  • cost-predictable real-time analytics

And that's how you stop batching for real.

If you're building streaming analytics today, this is one of those “small change, big consequence” moments.

The search world learned this lesson the hard way.
It's good to see analytics catching up with better tools, and fewer scars.

Share

Sylvain Utard, Co-Founder & CEO at Altertable

Sylvain Utard

Co-Founder & CEO

Seasoned leader in B2B SaaS and B2C. Scaled 100+ teams at Algolia (1st hire) & Sorare. Passionate about data, performance and productivity.

Stay Updated

Get the latest insights on data, AI, and modern infrastructure delivered to your inbox

Related Articles

Continue exploring topics related to this article

One Billion Rows
FEBRUARY 17TH, 2026
Sylvain Utard

One Billion Rows

Product, Performance, Engineering

At 1 billion rows, every shortcut comes back to collect interest. Here's how we achieved sub-second queries with near-realtime ingestion.

READ ARTICLE
Pruning Top-N Queries
FEBRUARY 3RD, 2026
Sylvain Utard

Pruning Top-N Queries

Open Source, Performance, Architecture

A deep dive into DuckLake PR #668 and how Top-N dynamic filter pruning turns ORDER BY + LIMIT from full scans into metadata-driven execution.

READ ARTICLE
Stop Batching Analytics
DECEMBER 30TH, 2025
Sylvain Utard

Stop Batching Analytics

Analytics, Architecture, Performance

Why we're forcing analytics through complex batch pipelines when append-only data should work like logs. The warehouse constraint that stopped making sense.

READ ARTICLE
NetQuack 4000x Faster
SEPTEMBER 30TH, 2025
Sylvain Utard

NetQuack 4000x Faster

Performance, Engineering

We rewrote NetQuack DuckDB extension, replacing regex with character parsing. Result: 4000x faster—37 seconds down to 0.012 seconds.

READ ARTICLE
Rethinking the Lakehouse
JULY 30TH, 2025
Yannick Utard

Rethinking the Lakehouse

Architecture, Performance, Data Stack

Breaking down our storage and query architecture: why we're leaning into Apache Iceberg and why DuckDB is emerging as our real-time query engine of choice.

READ ARTICLE
Altertable Logo

Wake Up To Insights

Join product, growth, and engineering teams enabling continuous discovery