Share

DECEMBER 30, 2025

7 MIN READ

SYLVAIN UTARD

Stop Batching Analytics

Why we're forcing analytics through complex batch pipelines when append-only data should work like logs. The warehouse constraint that stopped making sense.

Listen to this article (Gen-AI)

0:00

5:02

If you've ever tried to push application data into an analytical warehouse, you know the routine: you start with a few tables and an "export to the warehouse"… and end up operating a pipeline.

The uncomfortable part is that when your inputs are already append-friendly (events, logs, raw facts; time-ordered records that mostly don't change), the default path still forces you into a pipeline and all the usual full-reload vs incremental-sync complexity. The modern data stack has become a Rube Goldberg machine.

TL;DR

The problem: Warehouses optimize for querying columnar files, not for ingesting a constant stream of tiny writes, so teams end up with full reloads or fragile incremental syncs.
The claim: For the common “events + append-friendly facts” case, ingestion should feel like logging: POST rows, query seconds later, accept at-least-once semantics.
The trade: Move complexity from your pipeline into the storage/ingestion layer: buffering, compaction, schema evolution, and treat “batching” as an implementation detail rather than an API requirement.

This is intentionally not a claim about every dataset. If your workload is dominated by frequent updates to existing rows (a classic OLTP workload), you're in a different world.

The Two Defaults Everyone Ends Up With

Most teams land in one of two camps.

Option 1: Full reloads

Periodically repush everything.

Pros: Simple to reason about, easy to debug
Cons: Gets catastrophically expensive as data grows; latency measured in hours

Works until it very suddenly doesn't.

Option 2: Incremental syncs

CDC, cursors, watermarks, “last updated at”.

Pros: Faster in theory
Cons: Fragile around schema changes, deletes, and backfills; requires state + retries + edge-case handling

This is where tools like Airbyte, Fivetran, or homegrown pipelines enter the picture. They help, but they don't change the underlying model: you're still building and maintaining ingestion state.

“Just Stream It”: Famous Last Words

At some point, someone suggests the obvious:

“Why don't we just stream the data continuously?”

You already do this for logs and metrics:

Datadog
Elasticsearch/OpenSearch
Loki
Prometheus remote write

You POST events. They show up quickly. You don't design a compaction strategy up front.

So why does analytics feel so different?

Because most warehouses really don't want your workload.

Why Warehouses Hate Your `INSERT`s

Modern analytical systems are optimized for:

Large batches
Immutable files, often stored as immutable segments
Columnar layouts
Background compaction

Learn about our lakehouse storage architecture built on Apache Iceberg

.

Small, frequent inserts are poison:

Too many tiny files/parts/segments
Too much metadata churn
Poor query performance without aggressive merging
Costs that scale with “how often you poked the system”, not with data volume

So the industry standardized on a workaround:

Buffer → queue → object storage → batch load

Kafka. Kinesis. S3. Iceberg/Delta. BigQuery loads. Snowflake stages. This works and scales, but for many teams it's wildly overkill relative to the questions they're trying to answer.

And yes: the big warehouses have near-real-time ingestion paths, for example, BigQuery streaming inserts / Storage Write API, and Snowflake Snowpipe / Snowpipe Streaming. The catch is that these options tend to come with their own “pipeline tax”: quotas and cost models that get painful at high event rates, operational gotchas around idempotency/duplicates, and practical pressure to introduce staging/connector infrastructure anyway. So culturally and operationally, most teams still end up with a pipeline, because the safe path is still “batch it somewhere else, then load it”.

Why This Mental Model Is Backwards

When you're doing event-style application/product analytics, the raw data usually looks like:

1
2
event happened at time T
with payload P

Append-only. Mostly immutable. Often naturally partitionable by time.

But instead of treating it like that, we:

Buffer it
Rebatch it
Reformat it
Reload it
Then finally query it

All because “warehouses don't like streaming”.

At some point, we stopped questioning that assumption.

What “Ingestion Like Logging” Would Mean

Not “exactly-once”. Not “perfect ordering”. Not “transactions across ten tables”.

More like:

Insert API: send rows as they happen
Latency: data becomes queryable quickly, seconds rather than hours
Ops: no offsets, no micro-batch jobs, no staging buckets

This matches what most teams actually need for exploratory/product analytics: get the data in, accept that it's messy, and make it queryable.

What Changed Recently: Why This Isn't Reckless Anymore

This idea would have been irresponsible 10 years ago, but a few things are different now:

Columnar formats are mature
Object storage is cheap
Embedded analytical engines are fast
Background compaction is table stakes
Vectorized execution hides a lot of sins

In other words: we can absorb messiness at write time and clean it up later. That's how every log system works: accept writes cheaply, then compact/index/optimize asynchronously. Learn how we're bringing battle-tested compaction strategies from search engines to streaming analytics.

The Trade-offs

There are always trade-offs.

This approach is not for:

Ultra-low-latency OLTP queries
High-frequency updates to the same rows
Strong transactional semantics across tables
Workloads where “delete” must behave like an immediate, globally consistent erase

You're trading:

Perfect write efficiency and perfect semantics

for

Dramatically simpler ingestion and lower operational overhead

The key is: batching doesn't disappear. It moves. The system still buffers and compacts, but as an internal implementation detail, not as the API you have to design around.

And yes, you still need an answer for updates/deletes when they show up. In an append-first model that typically means tombstones + periodic compaction/merge (again, internal mechanics, not your pipeline).

A Concrete Example: `POST /append`

At Altertable, we wanted ingestion to feel closer to logging, so we exposed a simple endpoint:

POST /append

You send rows. We accept them. They land in the lakehouse and become queryable quickly. Under the hood there's buffering, compaction, and merge machinery, but that's the storage engine's job, not your pipeline's.

If you want the concrete API shape, it's documented here:

See the API documentation →

Schema Evolution Without the Pipeline Tax

One of the hidden complexities in traditional pipelines is schema evolution:

You add a field
A type changes
You need a backfill

Most setups force you to pre-declare schemas and coordinate migrations across producers and pipelines. A different approach is: infer schemas from data and evolve automatically with sane coercion rules, while still allowing an explicit schema when you need control.

The rough model:

First append: infer types from JSON
- Numbers → BIGINT or DOUBLE
- Strings → VARCHAR
- RFC3339 timestamps → TIMESTAMP
- Arrays/objects → JSON
Subsequent appends: compare inferred schema to existing table
- New columns → add automatically
- Type conflicts → coerce to compatible types, e.g. int + float → float
- Missing fields → fill with NULL
No breaking changes by default: existing queries keep working

Is this perfect type safety? No. It's “good enough for analytics”, and it removes a whole category of operational work.

If you want strict control, you can still create schemas manually via CREATE TABLE. Automatic inference is a convenience, not a requirement.

The Broader Point Isn't Vendor-Specific

You could build this yourself. Some companies already have.

The point is broader:

We've normalized absurd ingestion complexity because we stopped questioning a constraint that often doesn't need to leak into user-facing architecture.

Most teams don't actually need: