Modern analytics systems love to talk about real-time ingestion. In practice, many of them still behave like batch systems wearing a streaming trench coat.
If you ingest events continuously, you inevitably produce many small files. And once you do, every query, every scan, every merge starts paying an invisible tax: file metadata, I/O amplification, cache churn, scheduler overhead.
This problem is not new.
Search engines have been fighting it for decades.
Recently, DuckLake gained a small but important capability: the ability to filter which files are eligible for compaction based on their size. This landed via the addition of min_file_size and max_file_size parameters to ducklake_merge_adjacent_files, a change we contributed upstream to DuckLake.
This post explains why this matters, where the idea comes from, and how tiered compaction changes the economics of real-time analytics.
The Small-File Problem Is Structural
Any pipeline that turns streamed writes into object storage files (Kafka/CDC sinks, event collectors, “streaming” lake ingestion, etc.) without integrated, size-aware compaction—will eventually produce files that are:
- too small to be efficient
- too numerous to scan cheaply
- unevenly distributed in time
You can delay this with buffering. You can hide it with batching.
But if you truly want low-latency ingestion, small files are not a bug: they're a consequence.
(Contrast: LSM-tree storage engines like RocksDB or Cassandra do integrate compaction into the write path; this post is about lake-style storage where ingestion pushes immutable files into passive storage.)
The real question isn't how to avoid small files.
It's how to absorb them without destroying read performance.
Search Engines Solved This a Long Time Ago
Before analytics engines rediscovered “streaming,” search engines already lived there.
Lucene-based systems like Elasticsearch treat every flush as an immutable segment. Indexing is continuous, queries are concurrent, and the index is always in motion.
The key insight was this:
You don't merge everything all the time.
You merge similar things, incrementally, in tiers.
Lucene's tiered merge policy ensures that:
- small segments merge quickly
- large segments merge rarely
- I/O amplification is bounded
- merges don't cascade uncontrollably
This is what allows search engines to ingest thousands of writes per second while staying queryable.
DuckLake Is Getting the Same Primitive
DuckLake stores data in immutable files. That's a great property: immutable data is simple, safe, and cache-friendly.
But immutability only works at scale if you have good compaction.
The recent DuckLake change adds two optional parameters to ducklake_merge_adjacent_files:
min_file_sizemax_file_size
This seems small. It isn't.
It allows you to selectively compact files based on size, which is the missing lever for tiered strategies.
When combined with the existing target_file_size parameter, you can control both:
- which files participate in a merge (
min_file_size/max_file_size) - how large the resulting merged file should be (
target_file_size)
A Tiered Compaction Strategy (Concrete Example)
Here's a practical, Lucene-inspired strategy for streaming ingestion:
- Tier 0 → Tier 1: Merge small files (< 1MB) into ~10MB files
- Tier 1 → Tier 2: Merge medium files (1MB-10MB) into ~50MB files
- Tier 2 → Tier 3: Merge large files (10MB-50MB) into ~200MB files
Each tier uses min_file_size and max_file_size to filter candidates, then target_file_size to cap the output. This gives you clear invariants:
- You never merge files that are wildly different in size
- You cap the output size of each merge
- You bound I/O amplification per tier
- You prevent "merge storms"
This is not theoretical: this exact approach has powered large-scale search infrastructure for years.
Why This Matters for Analytics (Not Just Storage)
Analytics workloads are different from search, but the ingestion problem is the same.
Without tiered compaction:
- small files linger too long
- scans touch too many objects
- background merges compete with queries
- latency becomes unpredictable
With tiered compaction:
- recent data compacts quickly
- historical data stabilizes
- scans become progressively cheaper
- compaction work (CPU + I/O) is spread out over time, avoiding latency spikes
This is especially important for incremental analytics: dashboards, funnels, feature tracking, real-time metrics.
You want:
- fresh data now
- stable data cheap
- no global reprocessing
Tiered compaction gives you that shape.
Coming from the Search World
Before working on analytics systems, I spent years in search:
Segment merging, compaction pressure, streamed write logs: this was daily reality.
What's exciting today is seeing these battle-tested ideas finally land in analytical systems like DuckLake.
Not as academic features but as practical primitives engineers can build on.
Compaction Is a Policy, Not a Background Task
The biggest mental shift is this:
Compaction is not a cleanup job.
It is part of your ingestion strategy.
By exposing size-based filtering, DuckLake lets you:
- encode intent
- control amplification
- adapt to workload shape
This opens the door to:
- tier-aware scheduling
- ingestion-rate-adaptive merging
- cost-predictable real-time analytics
And that's how you stop batching for real.
If you're building streaming analytics today, this is one of those “small change, big consequence” moments.
The search world learned this lesson the hard way.
It's good to see analytics catching up with better tools, and fewer scars.



