FEBRUARY 17, 2026

4 MIN READ

SYLVAIN UTARD

One Billion Rows

One Billion Rows

At 1 billion rows, every shortcut comes back to collect interest. Here's how we achieved sub-second queries with near-realtime ingestion.

Blog

We just crossed 1,000,000,000 rows in a single customer's events table.

It did not go smoothly.

For the past two months in private beta, we pushed one real production workload to a scale where nice architectural diagrams stop mattering and physics starts enforcing rules. At that point, small inefficiencies compound, metadata balloons, caches thrash, and comfortable assumptions collapse. We knew it would at some point, of course. But there's a difference between anticipating scale and living through it.

Today that table holds:

  • 1B rows
  • 1.6TB of raw data
  • 33 columns per row (30 VARCHAR with low to medium cardinality, 1 TIMESTAMP, 2 very-large JSON)
  • Near-realtime ingestion
  • Continuous production traffic

Queries like this run in about a second:

SELECT event, COUNT(*)
FROM events
GROUP BY 1
ORDER BY 2 DESC;

So does this:

SELECT uuid
FROM events
ORDER BY timestamp DESC
LIMIT 10;

And even this (mixed filters, JSON extraction, and a GROUP BY):

SELECT
date_trunc('day', timestamp),
bigjson_extract(properties, '$.my_attr'),
COUNT(*)
FROM events
WHERE event = 'My Event'
AND timestamp > NOW() - interval 1 month
GROUP BY 1, 2
ORDER BY 1, 2;

Most of the time, it's sub-second.

Getting there broke almost everything.

When Scale Gets Real

At 10M rows, everything looks fine. At 100M, you start noticing patterns. At 1B, every shortcut you ever took comes back to collect interest.

We experimented with:

  • Partitioning strategies that looked great on paper, and terrible in practice
  • Different lakehouse sort orders
  • Multiple Parquet compression algorithms
  • Compaction heuristics to fight the small-file problem
  • Reimplementing a BIGJSON field to avoid naïve JSON extraction costs
  • Different caching strategies and eviction configurations
  • Scaling underlying machines and memory envelopes
  • Contributing upstream improvements to DuckDB and DuckLake (eg. #668)

In a lakehouse architecture, nothing is isolated: storage layout, compression, sort order, caching, and vectorized execution are tightly coupled. Every layer mattered, and every "small" choice had a way of surfacing later as either a bill, a latency spike, or a new operational headache.

You don't feel that at 10 million rows; you absolutely feel it at a billion.

The Compression × Sort Order Surprise

One of the most counterintuitive lessons was how tightly compression and sort order are bound together.

Most teams think of compression as a storage concern (something that affects cost per GB. In a lakehouse, compression is a performance primitive.

Sort order influences:

  • How well Parquet encodes values
  • How effective run-length and dictionary encoding become
  • How much data (and how many files) must be scanned for common filters
  • How vectorized execution behaves in DuckDB
  • How much I/O your system actually performs

Better sort order → better compression patterns → less I/O → faster queries → lower cost.

At scale, this isn't marginal. It's structural.

We ended up tuning sort strategies not just for filtering efficiency, but for compression behavior under realistic mixed workloads: time-based scans, GROUP BY event, JSON extraction on properties, and recent-first queries.

Every change cascaded. We found that optimizing for a constrained environment (bounded hardware) forces a better architecture than simply scaling up resources. When you solve for a specific memory or CPU envelope, the gains are structural; and because they fix the underlying physics, they benefit the entire system. Starting with maximum resources often masks fundamental flaws; optimizing for constraints forces you to fix them. Those gains then benefit everything.

Keeping Data Awake

The harder problem wasn't just performance; it was economics.

Running sub-second queries on a billion-row table is easy if you're willing to burn warehouse credits, overprovision clusters, and accept unpredictable bills.

But when queries cost money, people stop asking them.

And when people stop asking questions, data goes back to sleep.

This table isn't archived. It isn't cold storage. It isn't "optimized for a BI refresh every 6 hours."

It is active:

  • Near-realtime ingestion
  • Sub-second interactive queries
  • Continuous production workload

No per-query billing, no per-seat pricing, and no "rows scanned" anxiety.

Because the goal isn't to sell compute; it's to keep intelligence awake.

A Quiet Milestone

One customer.
One table.
One billion rows.
1.6TB of raw data.
Sub-second queries.
Near-realtime ingestion.

Two months ago, this was a hypothesis. Today, it's production reality, and we're just getting started.

Share

Sylvain Utard, Co-Founder & CEO at Altertable

Sylvain Utard

Co-Founder & CEO

Seasoned leader in B2B SaaS and B2C. Scaled 100+ teams at Algolia (1st hire) & Sorare. Passionate about data, performance and productivity.

Stay Updated

Get the latest insights on data, AI, and modern infrastructure delivered to your inbox

Related Articles

Continue exploring topics related to this article

Speed Shapes Understanding
OCTOBER 7TH, 2025
Sylvain Utard

Speed Shapes Understanding

Performance, Product

Speed isn't just a luxury: it's the difference between insight and inertia. We've been deep in TPC-H benchmarks, tuning our analytical engine for AI agents.

READ ARTICLE
Lessons from Search
JANUARY 13TH, 2026
Sylvain Utard

Lessons from Search

Performance, Architecture, Engineering

Real-time analytics faces the small-file problem search engines solved. DuckLake's tiered compaction brings those merge strategies to streaming analytics.

READ ARTICLE
NetQuack 4000x Faster
SEPTEMBER 30TH, 2025
Sylvain Utard

NetQuack 4000x Faster

Performance, Engineering

We rewrote NetQuack DuckDB extension, replacing regex with character parsing. Result: 4000x faster—37 seconds down to 0.012 seconds.

READ ARTICLE
Pruning Top-N Queries
FEBRUARY 3RD, 2026
Sylvain Utard

Pruning Top-N Queries

Open Source, Performance, Architecture

A deep dive into DuckLake PR #668 and how Top-N dynamic filter pruning turns ORDER BY + LIMIT from full scans into metadata-driven execution.

READ ARTICLE
From Task Executors to Outcome Owners
JANUARY 28TH, 2026
Kevin Granger

From Task Executors to Outcome Owners

AI Agents, Product, Culture

AI is transforming data roles from task execution to strategic ownership. Learn how data teams are evolving in 2026 and what skills matter most in the AI era.

READ ARTICLE
Altertable Logo

Wake Up To Insights

Join product, growth, and engineering teams enabling continuous discovery