DECEMBER 30, 2025

7 MIN READ

SYLVAIN UTARD

Stop Batching Analytics

Stop Batching Analytics

Why we're forcing analytics through complex batch pipelines when append-only data should work like logs. The warehouse constraint that stopped making sense.

Blog

If you've ever tried to push application data into an analytical warehouse, you know the routine: you start with a few tables and an “export to the warehouse”… and end up operating a pipeline.

The uncomfortable part is that when your inputs are already append-friendly (events, logs, raw facts; time-ordered records that mostly don't change), the default path still forces you into a pipeline and all the usual full-reload vs incremental-sync complexity.

TL;DR

  • The problem: Warehouses optimize for querying columnar files, not for ingesting a constant stream of tiny writes, so teams end up with full reloads or fragile incremental syncs.
  • The claim: For the common “events + append-friendly facts” case, ingestion should feel like logging: POST rows, query seconds later, accept at-least-once semantics.
  • The trade: Move complexity from your pipeline into the storage/ingestion layer: buffering, compaction, schema evolution, and treat “batching” as an implementation detail rather than an API requirement.

This is intentionally not a claim about every dataset. If your workload is dominated by frequent updates to existing rows (a classic OLTP workload), you're in a different world.

The Two Defaults Everyone Ends Up With

Most teams land in one of two camps.

Option 1: Full reloads

Periodically repush everything.

  • Pros: Simple to reason about, easy to debug
  • Cons: Gets catastrophically expensive as data grows; latency measured in hours

Works until it very suddenly doesn't.

Option 2: Incremental syncs

CDC, cursors, watermarks, “last updated at”.

  • Pros: Faster in theory
  • Cons: Fragile around schema changes, deletes, and backfills; requires state + retries + edge-case handling

This is where tools like Airbyte, Fivetran, or homegrown pipelines enter the picture. They help, but they don't change the underlying model: you're still building and maintaining ingestion state.

“Just Stream It”: Famous Last Words

At some point, someone suggests the obvious:

“Why don't we just stream the data continuously?”

You already do this for logs and metrics:

  • Datadog
  • Elasticsearch/OpenSearch
  • Loki
  • Prometheus remote write

You POST events. They show up quickly. You don't design a compaction strategy up front.

So why does analytics feel so different?

Because most warehouses really don't want your workload.

Why Warehouses Hate Your INSERTs

Modern analytical systems are optimized for:

  • Large batches
  • Immutable files, often stored as immutable segments
  • Columnar layouts
  • Background compaction

Small, frequent inserts are poison:

  • Too many tiny files/parts/segments
  • Too much metadata churn
  • Poor query performance without aggressive merging
  • Costs that scale with “how often you poked the system”, not with data volume

So the industry standardized on a workaround:

Buffer → queue → object storage → batch load

Kafka. Kinesis. S3. Iceberg/Delta. BigQuery loads. Snowflake stages. This works and scales, but for many teams it's wildly overkill relative to the questions they're trying to answer.

And yes: the big warehouses have near-real-time ingestion paths, for example, BigQuery streaming inserts / Storage Write API, and Snowflake Snowpipe / Snowpipe Streaming. The catch is that these options tend to come with their own “pipeline tax”: quotas and cost models that get painful at high event rates, operational gotchas around idempotency/duplicates, and practical pressure to introduce staging/connector infrastructure anyway. So culturally and operationally, most teams still end up with a pipeline, because the safe path is still “batch it somewhere else, then load it”.

Why This Mental Model Is Backwards

When you're doing event-style application/product analytics, the raw data usually looks like:

event happened at time T
with payload P

Append-only. Mostly immutable. Often naturally partitionable by time.

But instead of treating it like that, we:

  • Buffer it
  • Rebatch it
  • Reformat it
  • Reload it
  • Then finally query it

All because “warehouses don't like streaming”.

At some point, we stopped questioning that assumption.

What “Ingestion Like Logging” Would Mean

Not “exactly-once”. Not “perfect ordering”. Not “transactions across ten tables”.

More like:

  • Insert API: send rows as they happen
  • Latency: data becomes queryable quickly, seconds rather than hours
  • Ops: no offsets, no micro-batch jobs, no staging buckets

This matches what most teams actually need for exploratory/product analytics: get the data in, accept that it's messy, and make it queryable.

What Changed Recently: Why This Isn't Reckless Anymore

This idea would have been irresponsible 10 years ago, but a few things are different now:

  • Columnar formats are mature
  • Object storage is cheap
  • Embedded analytical engines are fast
  • Background compaction is table stakes
  • Vectorized execution hides a lot of sins

In other words: we can absorb messiness at write time and clean it up later. That's how every log system works: accept writes cheaply, then compact/index/optimize asynchronously.


The Trade-offs

There are always trade-offs.

This approach is not for:

  • Ultra-low-latency OLTP queries
  • High-frequency updates to the same rows
  • Strong transactional semantics across tables
  • Workloads where “delete” must behave like an immediate, globally consistent erase

You're trading:

  • Perfect write efficiency and perfect semantics

for

  • Dramatically simpler ingestion and lower operational overhead

The key is: batching doesn't disappear. It moves. The system still buffers and compacts, but as an internal implementation detail, not as the API you have to design around.

And yes, you still need an answer for updates/deletes when they show up. In an append-first model that typically means tombstones + periodic compaction/merge (again, internal mechanics, not your pipeline).

A Concrete Example: POST /append

At Altertable, we wanted ingestion to feel closer to logging, so we exposed a simple endpoint:

POST /append

You send rows. We accept them. They land in the lakehouse and become queryable quickly. Under the hood there's buffering, compaction, and merge machinery, but that's the storage engine's job, not your pipeline's.

If you want the concrete API shape, it's documented here:

See the API documentation →

Schema Evolution Without the Pipeline Tax

One of the hidden complexities in traditional pipelines is schema evolution:

  • You add a field
  • A type changes
  • You need a backfill

Most setups force you to pre-declare schemas and coordinate migrations across producers and pipelines. A different approach is: infer schemas from data and evolve automatically with sane coercion rules, while still allowing an explicit schema when you need control.

The rough model:

  1. First append: infer types from JSON

    • Numbers → BIGINT or DOUBLE
    • Strings → VARCHAR
    • RFC3339 timestamps → TIMESTAMP
    • Arrays/objects → JSON
  2. Subsequent appends: compare inferred schema to existing table

    • New columns → add automatically
    • Type conflicts → coerce to compatible types, e.g. int + float → float
    • Missing fields → fill with NULL
  3. No breaking changes by default: existing queries keep working

Is this perfect type safety? No. It's “good enough for analytics”, and it removes a whole category of operational work.

If you want strict control, you can still create schemas manually via CREATE TABLE. Automatic inference is a convenience, not a requirement.

The Broader Point Isn't Vendor-Specific

You could build this yourself. Some companies already have.

The point is broader:

We've normalized absurd ingestion complexity because we stopped questioning a constraint that often doesn't need to leak into user-facing architecture.

Most teams don't actually need:

  • Kafka clusters
  • Multi-hop pipelines
  • Nightly backfills as a normal operating mode
  • A bespoke reprocessing strategy before they can ask questions

They need data to arrive quickly and reliably, without babysitting.

The Open Question

Maybe the real issue isn't tooling.

Maybe it's that we've spent ~15 years optimizing analytics systems for querying, and almost none optimizing them for being written to.

If analytics data is append-only by nature, why are we still treating ingestion like a batch job?

I don't think the answer is settled. But I'm convinced the status quo is lazier than we admit.

Share

Sylvain Utard, Co-Founder & CEO at Altertable

Sylvain Utard

Co-Founder & CEO

Seasoned leader in B2B SaaS and B2C. Scaled 100+ teams at Algolia (1st hire) & Sorare. Passionate about data, performance and productivity.

Stay Updated

Get the latest insights on data, AI, and modern infrastructure delivered to your inbox

Other Selected Articles

Stay updated with our thoughts on data, AI, and the tools we deserve.

From STDIO to OAuth
OCTOBER 21ST, 2025
Sylvain Utard

From STDIO to OAuth

How MCP evolved from local stdio to OAuth 2.0 for cloud-scale AI, using Dynamic Client Registration for secure agent access.

READ ARTICLE
Speed Shapes Understanding
OCTOBER 7TH, 2025
Sylvain Utard

Speed Shapes Understanding

Speed isn't just a luxury: it's the difference between insight and inertia. We've been deep in TPC-H benchmarks, tuning our analytical engine for AI agents.

READ ARTICLE
NetQuack 4000x Faster
SEPTEMBER 30TH, 2025
Sylvain Utard

NetQuack 4000x Faster

We rewrote NetQuack DuckDB extension, replacing regex with character parsing. Result: 4000x faster—37 seconds down to 0.012 seconds.

READ ARTICLE
Altertable Logo

Wake Up To Insights

Join product, growth, and engineering teams enabling continuous discovery