Share

JANUARY 13, 2026

5 MIN READ

SYLVAIN UTARD

Lessons from Search

Real-time analytics faces the small-file problem search engines solved. DuckLake's tiered compaction brings those merge strategies to streaming analytics.

Listen to this article (Gen-AI)

0:00

Modern analytics systems love to talk about real-time ingestion. In practice, many of them still behave like batch systems wearing a streaming trench coat.

If you ingest events continuously, you inevitably produce many small files. And once you do, every query, every scan, every merge starts paying an invisible tax: file metadata, I/O amplification, cache churn, scheduler overhead. We need to stop forcing analytics through complex batch pipelines.

This problem is not new.

Search engines have been fighting it for decades.

Recently, DuckLake gained a small but important capability: the ability to filter which files are eligible for compaction based on their size. DuckLake is DuckDB's newly introduced integrated data lake and catalog format. This landed via the addition of min_file_size and max_file_size parameters to ducklake_merge_adjacent_files, a change we contributed upstream to DuckLake.

This post explains why this matters, where the idea comes from, and how tiered compaction changes the economics of real-time analytics.

The Small-File Problem Is Structural

Any pipeline that turns streamed writes into object storage files (Kafka/CDC sinks, event collectors, “streaming” lake ingestion, etc.) without integrated, size-aware compaction—will eventually produce files that are:

too small to be efficient
too numerous to scan cheaply
unevenly distributed in time

You can delay this with buffering. You can hide it with batching.

But if you truly want low-latency ingestion, small files are not a bug: they're a consequence.

(Contrast: LSM-tree storage engines like RocksDB or Cassandra do integrate compaction into the write path; this post is about lake-style storage where ingestion pushes immutable files into passive storage.)

The real question isn't how to avoid small files.

It's how to absorb them without destroying read performance.

Search Engines Solved This a Long Time Ago

Before analytics engines rediscovered “streaming,” search engines already lived there.

Lucene-based systems like Elasticsearch treat every flush as an immutable segment. Indexing is continuous, queries are concurrent, and the index is always in motion.

The key insight was this:

You don't merge everything all the time.
You merge similar things, incrementally, in tiers.

Lucene's tiered merge policy ensures that:

small segments merge quickly
large segments merge rarely
I/O amplification is bounded
merges don't cascade uncontrollably

This is what allows search engines to ingest thousands of writes per second while staying queryable.

DuckLake Is Getting the Same Primitive

DuckLake stores data in immutable files. That's a great property: immutable data is simple, safe, and cache-friendly.

But immutability only works at scale if you have good compaction.

The recent DuckLake change adds two optional parameters to ducklake_merge_adjacent_files:

min_file_size
max_file_size

This seems small. It isn't.

It allows you to selectively compact files based on size, which is the missing lever for tiered strategies.

When combined with the existing target_file_size parameter, you can control both:

which files participate in a merge (min_file_size / max_file_size)
how large the resulting merged file should be (target_file_size)

A Tiered Compaction Strategy (Concrete Example)

Here's a practical, Lucene-inspired strategy for streaming ingestion:

Tier 0 → Tier 1: Merge small files (< 1MB) into ~10MB files
Tier 1 → Tier 2: Merge medium files (1MB-10MB) into ~50MB files
Tier 2 → Tier 3: Merge large files (10MB-50MB) into ~200MB files

Each tier uses min_file_size and max_file_size to filter candidates, then target_file_size to cap the output. This gives you clear invariants:

You never merge files that are wildly different in size
You cap the output size of each merge
You bound I/O amplification per tier
You prevent "merge storms"

This is not theoretical: this exact approach has powered large-scale search infrastructure for years.

Why This Matters for Analytics (Not Just Storage)

Analytics workloads are different from search, but the ingestion problem is the same.

Without tiered compaction:

small files linger too long
scans touch too many objects
background merges compete with queries
latency becomes unpredictable

With tiered compaction:

recent data compacts quickly
historical data stabilizes
scans become progressively cheaper
compaction work (CPU + I/O) is spread out over time, avoiding latency spikes

This is especially important for incremental analytics: dashboards, funnels, feature tracking, real-time metrics.

You want:

fresh data now
stable data cheap
no global reprocessing

Tiered compaction gives you that shape.

Coming from the Search World

Before working on analytics systems, I spent years in search:

at Exalead
then as the first employee at Algolia, where I led the engineering team

Segment merging, compaction pressure, streamed write logs: this was daily reality.

What's exciting today is seeing these battle-tested ideas finally land in analytical systems like DuckLake. See how DuckLake enables hybrid compute architectures.

Not as academic features but as practical primitives engineers can build on.