If you've ever tried to push application data into an analytical warehouse, you know the routine: you start with a few tables and an “export to the warehouse”… and end up operating a pipeline.
The uncomfortable part is that when your inputs are already append-friendly (events, logs, raw facts; time-ordered records that mostly don't change), the default path still forces you into a pipeline and all the usual full-reload vs incremental-sync complexity.
TL;DR
- The problem: Warehouses optimize for querying columnar files, not for ingesting a constant stream of tiny writes, so teams end up with full reloads or fragile incremental syncs.
- The claim: For the common “events + append-friendly facts” case, ingestion should feel like logging:
POSTrows, query seconds later, accept at-least-once semantics. - The trade: Move complexity from your pipeline into the storage/ingestion layer: buffering, compaction, schema evolution, and treat “batching” as an implementation detail rather than an API requirement.
This is intentionally not a claim about every dataset. If your workload is dominated by frequent updates to existing rows (a classic OLTP workload), you're in a different world.
The Two Defaults Everyone Ends Up With
Most teams land in one of two camps.
Option 1: Full reloads
Periodically repush everything.
- Pros: Simple to reason about, easy to debug
- Cons: Gets catastrophically expensive as data grows; latency measured in hours
Works until it very suddenly doesn't.
Option 2: Incremental syncs
CDC, cursors, watermarks, “last updated at”.
- Pros: Faster in theory
- Cons: Fragile around schema changes, deletes, and backfills; requires state + retries + edge-case handling
This is where tools like Airbyte, Fivetran, or homegrown pipelines enter the picture. They help, but they don't change the underlying model: you're still building and maintaining ingestion state.
“Just Stream It”: Famous Last Words
At some point, someone suggests the obvious:
“Why don't we just stream the data continuously?”
You already do this for logs and metrics:
- Datadog
- Elasticsearch/OpenSearch
- Loki
- Prometheus remote write
You POST events. They show up quickly. You don't design a compaction strategy up front.
So why does analytics feel so different?
Because most warehouses really don't want your workload.
Why Warehouses Hate Your INSERTs
Modern analytical systems are optimized for:
- Large batches
- Immutable files, often stored as immutable segments
- Columnar layouts
- Background compaction
Small, frequent inserts are poison:
- Too many tiny files/parts/segments
- Too much metadata churn
- Poor query performance without aggressive merging
- Costs that scale with “how often you poked the system”, not with data volume
So the industry standardized on a workaround:
Buffer → queue → object storage → batch load
Kafka. Kinesis. S3. Iceberg/Delta. BigQuery loads. Snowflake stages. This works and scales, but for many teams it's wildly overkill relative to the questions they're trying to answer.
And yes: the big warehouses have near-real-time ingestion paths, for example, BigQuery streaming inserts / Storage Write API, and Snowflake Snowpipe / Snowpipe Streaming. The catch is that these options tend to come with their own “pipeline tax”: quotas and cost models that get painful at high event rates, operational gotchas around idempotency/duplicates, and practical pressure to introduce staging/connector infrastructure anyway. So culturally and operationally, most teams still end up with a pipeline, because the safe path is still “batch it somewhere else, then load it”.
Why This Mental Model Is Backwards
When you're doing event-style application/product analytics, the raw data usually looks like:
event happened at time Twith payload P
Append-only. Mostly immutable. Often naturally partitionable by time.
But instead of treating it like that, we:
- Buffer it
- Rebatch it
- Reformat it
- Reload it
- Then finally query it
All because “warehouses don't like streaming”.
At some point, we stopped questioning that assumption.
What “Ingestion Like Logging” Would Mean
Not “exactly-once”. Not “perfect ordering”. Not “transactions across ten tables”.
More like:
- Insert API: send rows as they happen
- Latency: data becomes queryable quickly, seconds rather than hours
- Ops: no offsets, no micro-batch jobs, no staging buckets
This matches what most teams actually need for exploratory/product analytics: get the data in, accept that it's messy, and make it queryable.
What Changed Recently: Why This Isn't Reckless Anymore
This idea would have been irresponsible 10 years ago, but a few things are different now:
- Columnar formats are mature
- Object storage is cheap
- Embedded analytical engines are fast
- Background compaction is table stakes
- Vectorized execution hides a lot of sins
In other words: we can absorb messiness at write time and clean it up later. That's how every log system works: accept writes cheaply, then compact/index/optimize asynchronously.
The Trade-offs
There are always trade-offs.
This approach is not for:
- Ultra-low-latency OLTP queries
- High-frequency updates to the same rows
- Strong transactional semantics across tables
- Workloads where “delete” must behave like an immediate, globally consistent erase
You're trading:
- Perfect write efficiency and perfect semantics
for
- Dramatically simpler ingestion and lower operational overhead
The key is: batching doesn't disappear. It moves. The system still buffers and compacts, but as an internal implementation detail, not as the API you have to design around.
And yes, you still need an answer for updates/deletes when they show up. In an append-first model that typically means tombstones + periodic compaction/merge (again, internal mechanics, not your pipeline).
A Concrete Example: POST /append
At Altertable, we wanted ingestion to feel closer to logging, so we exposed a simple endpoint:
POST /append
You send rows. We accept them. They land in the lakehouse and become queryable quickly. Under the hood there's buffering, compaction, and merge machinery, but that's the storage engine's job, not your pipeline's.
If you want the concrete API shape, it's documented here:
Schema Evolution Without the Pipeline Tax
One of the hidden complexities in traditional pipelines is schema evolution:
- You add a field
- A type changes
- You need a backfill
Most setups force you to pre-declare schemas and coordinate migrations across producers and pipelines. A different approach is: infer schemas from data and evolve automatically with sane coercion rules, while still allowing an explicit schema when you need control.
The rough model:
-
First append: infer types from JSON
- Numbers →
BIGINTorDOUBLE - Strings →
VARCHAR - RFC3339 timestamps →
TIMESTAMP - Arrays/objects →
JSON
- Numbers →
-
Subsequent appends: compare inferred schema to existing table
- New columns → add automatically
- Type conflicts → coerce to compatible types, e.g. int + float → float
- Missing fields → fill with
NULL
-
No breaking changes by default: existing queries keep working
Is this perfect type safety? No. It's “good enough for analytics”, and it removes a whole category of operational work.
If you want strict control, you can still create schemas manually via CREATE TABLE. Automatic inference is a convenience, not a requirement.
The Broader Point Isn't Vendor-Specific
You could build this yourself. Some companies already have.
The point is broader:
We've normalized absurd ingestion complexity because we stopped questioning a constraint that often doesn't need to leak into user-facing architecture.
Most teams don't actually need:
- Kafka clusters
- Multi-hop pipelines
- Nightly backfills as a normal operating mode
- A bespoke reprocessing strategy before they can ask questions
They need data to arrive quickly and reliably, without babysitting.
The Open Question
Maybe the real issue isn't tooling.
Maybe it's that we've spent ~15 years optimizing analytics systems for querying, and almost none optimizing them for being written to.
If analytics data is append-only by nature, why are we still treating ingestion like a batch job?
I don't think the answer is settled. But I'm convinced the status quo is lazier than we admit.



