JUNE 30, 2026

8 MIN READ

SYLVAIN UTARD

Stack Drift

Why mature data stacks keep accumulating copy layers, and how direct relational federation gives dbt a simpler path over live operational sources.

Share

Stack drift is what happens when every sensible data decision leaves behind another permanent layer.

A lot of modern data stacks were built from the same playbook: start with operational databases, add a cloud warehouse, introduce dbt and Airflow, use object storage as a landing zone, write Python extract jobs for the sources that do not fit cleanly, then layer BI and AI tools on top of the curated outputs.

For a while, that architecture can look mature. There are DAGs, models, landing zones, staging tables, marts, dashboards, alerts, retries, and runbooks. The system has all the visible signs of a serious data platform.

The day-to-day experience is usually less convincing. A dashboard goes stale, or someone asks Claude a warehouse-backed question and gets an answer that looks plausible but does not match the trusted metric. The failure could be in the extract code, the object-storage landing step, the warehouse load, a dbt model, a warehouse-specific access shim, a stale replica, an Airflow task, or a downstream serving table.

From the outside, this can sound like an endless list of good reasons why the answer is late, inconsistent, or not ready for production. Each reason may be technically correct. That is the trap. The architecture gives the team too many places where a valid explanation can hide.

That is not a healthy data platform. It is a system where correctness emerges only after enough retries, failures are hard to localize, and the most important operational knowledge lives in the heads of the people who have debugged it before.

We are used to seeing companies in exactly this situation. The stack has a large dbt estate, custom ELT code, orchestration, object-storage landing zones, a cloud warehouse, BI dashboards, and AI tools querying curated datasets. On paper, it looks like a mature modern data stack. In practice, maintaining it means operating a multi-hop pipeline with poor observability, unclear failure boundaries, duplicated ingestion paths, and too many places where the same row can be copied, transformed, delayed, or silently diverge.

The problem was not a lack of tooling. It was that too much of the tooling existed only to move relational data into the warehouse before anyone could transform it.

That is the copy-layer problem, and it is one of the most common forms of stack drift.

The copy layer shows up slowly

It rarely begins as a bad architecture decision. It begins with one practical need: get operational data into the warehouse so the data team can model it.

Then another source appears. Then a legacy database. Then a customer-facing workflow needs fresher numbers. Then a dashboard depends on a table that is produced by a custom extract job, a landing bucket, a warehouse load, a dbt model, and an Airflow DAG that has its own retry logic.

After a few years, many teams have several ways to move the same kind of relational data:

Python or Airflow jobs that mirror Postgres or MySQL
object-storage landing zones that exist only as a handoff into the warehouse
warehouse tables that are raw copies of operational tables
dbt models that clean and reshape those copies
occasional warehouse-specific federation wrappers for sources that were never fully copied

None of this is irrational. Each layer solved a real problem at the time.

The pain is that the layers become the platform. A row is no longer just a row. It is something that may have been extracted, serialized, landed, loaded, transformed, filtered, retried, and re-modeled before anyone can ask a business question about it.

When the number is wrong, the team does not debug the metric. They debug the path.

That path often looks like this:

Legacy warehouse copy layer: operational databases through extract jobs, landing zone, warehouse copies, dbt models, and downstream BI, apps, and agents

The better question

The useful question is not "Can we make the pipeline more reliable?"

The useful question is:

Why are we copying this relational data before dbt can model it?

For a lot of operational analytics, the source is already queryable. It is a database. The team does not need a bespoke transport system for every table before SQL can touch it. They need the runtime that executes dbt to see those sources directly.

That is where federation becomes real.

Not federation as a special warehouse function. Not SQL hidden inside strings. Not another exception path in a large warehouse project.

Federation as part of the lakehouse itself: operational systems added as external catalogs, queryable by the same SQL runtime that runs the dbt project.

The Altertable shape

With Altertable, the first step is to configure the operational sources as external catalogs in the lakehouse.

Those catalogs can point at relational systems such as Postgres, MySQL, or other operational stores. Once added, they behave like sources in the analytical environment. dbt can reference them directly, model them, test them, and produce the same downstream datasets the business already expects.

The shape becomes much simpler. Operational databases become external catalogs; dbt models those catalogs directly; BI, apps, and agents consume the modeled outputs.

Operational databases flowing through Altertable external catalogs into dbt models and downstream BI, apps, and agents

The important part is not that dbt changes. It is that the plumbing around dbt disappears for that relational slice.

No object-storage landing hop just to make the data queryable. No custom extract framework whose main job is to mirror tables. No warehouse-specific access wrapper that makes source reads feel separate from modeling.

dbt stays where it belongs: in the modeling layer. The runtime underneath it becomes responsible for reaching the data.

What stays outside

Federation is not a replacement for every integration.

API workflows still exist. SaaS systems still have auth flows, rate limits, webhooks, pagination, side effects, and sync semantics. Some of those paths still belong in small jobs or tools like Airbyte, Fivetran, and dlt.

The boundary is simple:

If the source is relational and the goal is analytical transformation, query it directly and model it in dbt. If the source is an API workflow, keep a thin ingestion runner.

That boundary is useful because it is boring. It does not ask the data team to throw away its practice. It asks the team to stop treating every relational table as something that must be copied before it can be understood.

Trace the drift

The easiest way to see stack drift is to pick one important dashboard or agent-facing dataset and trace a row backwards.

How many systems does it pass through before dbt can model it? How many jobs exist only to recreate a relational table somewhere else? How many retries, landing steps, raw tables, and freshness checks sit between the operational source and the business question?

Some copies are justified. Historical snapshots, API workflows, compliance exports, and third-party syncs can all need their own paths.

But if the source is relational, and the copy exists only because the warehouse cannot query it directly, that copy is not business logic. It is access plumbing.

That is the layer Altertable removes. Add the operational sources as external catalogs. Let dbt model them directly. Keep the business logic and downstream contracts. Delete the jobs whose only purpose was to move rows into position.

What changes

Dashboards were already enough pressure. AI makes stack drift harder to ignore.

If analysts, applications, and agents are all going to depend on the same operational data, the foundation cannot be a maze of hidden copies and delayed landing tables. The more places data can drift, the harder it becomes to trust any answer built on top of it.

That is the work Altertable is built for: one operational lakehouse where storage, query, and business context come together for applications, analytics, and agents.

For a CTO, the point is fewer moving parts between operational reality and the systems that use it.

For a data engineer, the point is spending less time nursing copy jobs and more time modeling the business.

For a CEO, the point is fewer technically correct reasons why the company still cannot trust the answer.

That is what it means to make federation real: not another workaround around the warehouse, but a different default for relational analytics.

Connect the source. Model the business. Stop operating the copy layer.

Share