We just crossed 1,000,000,000 rows in a single customer's events table.
It did not go smoothly.
For the past two months in private beta, we pushed one real production workload to a scale where nice architectural diagrams stop mattering and physics starts enforcing rules. At that point, small inefficiencies compound, metadata balloons, caches thrash, and comfortable assumptions collapse. We knew it would at some point, of course. But there's a difference between anticipating scale and living through it.
Today that table holds:
- 1B rows
- 1.6TB of raw data
- 33 columns per row (30 VARCHAR with low to medium cardinality, 1 TIMESTAMP, 2 very-large JSON)
- Near-realtime ingestion
- Continuous production traffic
Queries like this run in about a second:
SELECT event, COUNT(*)FROM eventsGROUP BY 1ORDER BY 2 DESC;
So does this:
SELECT uuidFROM eventsORDER BY timestamp DESCLIMIT 10;
And even this (mixed filters, JSON extraction, and a GROUP BY):
SELECTdate_trunc('day', timestamp),bigjson_extract(properties, '$.my_attr'),COUNT(*)FROM eventsWHERE event = 'My Event'AND timestamp > NOW() - interval 1 monthGROUP BY 1, 2ORDER BY 1, 2;
Most of the time, it's sub-second.
Getting there broke almost everything.
When Scale Gets Real
At 10M rows, everything looks fine. At 100M, you start noticing patterns. At 1B, every shortcut you ever took comes back to collect interest.
We experimented with:
- Partitioning strategies that looked great on paper, and terrible in practice
- Different lakehouse sort orders
- Multiple Parquet compression algorithms
- Compaction heuristics to fight the small-file problem
- Reimplementing a
BIGJSONfield to avoid naïve JSON extraction costs - Different caching strategies and eviction configurations
- Scaling underlying machines and memory envelopes
- Contributing upstream improvements to DuckDB and DuckLake (eg. #668)
In a lakehouse architecture, nothing is isolated: storage layout, compression, sort order, caching, and vectorized execution are tightly coupled. Every layer mattered, and every "small" choice had a way of surfacing later as either a bill, a latency spike, or a new operational headache.
You don't feel that at 10 million rows; you absolutely feel it at a billion.
The Compression × Sort Order Surprise
One of the most counterintuitive lessons was how tightly compression and sort order are bound together.
Most teams think of compression as a storage concern (something that affects cost per GB. In a lakehouse, compression is a performance primitive.
Sort order influences:
- How well Parquet encodes values
- How effective run-length and dictionary encoding become
- How much data (and how many files) must be scanned for common filters
- How vectorized execution behaves in DuckDB
- How much I/O your system actually performs
Better sort order → better compression patterns → less I/O → faster queries → lower cost.
At scale, this isn't marginal. It's structural.
We ended up tuning sort strategies not just for filtering efficiency, but for
compression behavior under realistic mixed workloads: time-based scans,
GROUP BY event, JSON extraction on properties, and recent-first queries.
Every change cascaded. We found that optimizing for a constrained environment (bounded hardware) forces a better architecture than simply scaling up resources. When you solve for a specific memory or CPU envelope, the gains are structural; and because they fix the underlying physics, they benefit the entire system. Starting with maximum resources often masks fundamental flaws; optimizing for constraints forces you to fix them. Those gains then benefit everything.
Keeping Data Awake
The harder problem wasn't just performance; it was economics.
Running sub-second queries on a billion-row table is easy if you're willing to burn warehouse credits, overprovision clusters, and accept unpredictable bills.
But when queries cost money, people stop asking them.
And when people stop asking questions, data goes back to sleep.
This table isn't archived. It isn't cold storage. It isn't "optimized for a BI refresh every 6 hours."
It is active:
- Near-realtime ingestion
- Sub-second interactive queries
- Continuous production workload
No per-query billing, no per-seat pricing, and no "rows scanned" anxiety.
Because the goal isn't to sell compute; it's to keep intelligence awake.
A Quiet Milestone
One customer.
One table.
One billion rows.
1.6TB of raw data.
Sub-second queries.
Near-realtime ingestion.
Two months ago, this was a hypothesis. Today, it's production reality, and we're just getting started.





