MARCH 27, 2026

9 MIN READ

SYLVAIN UTARD

Grep your lakehouse

Agents fail when they cannot retrieve the right data slice before writing SQL—not because they cannot generate queries.

Share

One pattern keeps showing up when we use code agents:

before they can do anything useful, they spend their time grepping to figure out where to look

At first, that sounds almost trivial. Of course you should be able to search your data.

But the more we thought about it, the more it looked like a structural limitation in how data systems are designed today.

The current stack works well when the question is already well formed:

which metric moved?
which segment changed?
what does revenue look like by plan?

That is the world of BI, dashboards, and semantic layers.

But many high-value agent workflows do not start there. They start with a messier prompt:

"find customers blocked on procurement"
"show me everything that looks like churn risk"
"what are enterprise users actually complaining about in Germany?"

Those are not metric questions first. They are retrieval questions first.

The query model assumes too much

Over the years, we've optimized everything around structured querying: better schemas, semantic layers, better metrics, better dashboards.

That progress matters. A good semantic layer makes raw data far more legible. It gives people and software a shared vocabulary for entities, metrics, and business concepts instead of forcing them to reason directly from tables and columns.

For BI-style workflows, that is often enough. If the job is "answer this KPI question correctly" or "translate this governed business question into SQL," the semantic layer is exactly the right abstraction.

But even that abstraction assumes you have a starting point. It helps once you roughly know what question to ask. It is much less helpful when the first problem is simply figuring out where the relevant signal lives.

You know the table.
You know the columns.
Or at least the business concepts that map onto them.
You know how to express the query.

That works well for humans and for text-to-SQL systems operating on known business concepts.

It breaks down quickly for agents exploring diverse operational data.

When an agent starts exploring a dataset, it often does not begin with a schema or a semantic model. It begins with intent:

"find conversations about GDPR"
"which customers asked about data residency"
"anything related to EU deployment issues"

There is no clear entry point into the data, only a vague idea of what might be relevant.

SQL is precise, but not forgiving

SQL is extremely powerful, but it is also very strict. It requires you to be precise from the start: exact filters, exact joins, exact structure.

That precision is what makes it reliable. But it also makes it a poor first step when you do not yet know where to look.

You do not explore a dataset with perfect queries. You explore it by approximating, iterating, and refining.

Which is exactly what search is good at.

Text-to-SQL is no longer the hard part

A year ago, many conversations about data access for AI centered on text-to-SQL: can a model translate a natural-language request into syntactically valid SQL?

By now, that capability is close to table stakes. Generating acceptable SQL from a prompt is increasingly a commodity, especially when the target schema or semantic model is already reasonably well described.

That does not solve the actual retrieval problem.

If an agent does not know which table contains the relevant text, which column carries the signal, or even which records are worth inspecting, then producing SQL is not enough. You still end up with the same failure mode: a precise query aimed at the wrong place.

For builders, this means the missing primitive is not SQL generation.

It is schema-agnostic retrieval.

What agents actually need

In practice, agents need a loop that looks more like this:

Start from intent.
Retrieve candidate records broadly.
Inspect the results and discover where the signal lives.
Refine with structured filters, joins, and aggregations.

That is not "text-to-SQL."

It is closer to search first, SQL second.

Humans already work this way informally. We scan a few rows. We run a sloppy first query. We look at field values. We tighten the query once we understand the data shape.

Agents need the same ability, but as a native system capability rather than a workaround.

That matters operationally. If the agent can do discovery inside the same engine it will later use for joins, filters, and aggregations, you avoid bolting on a second retrieval stack just to make open-ended workflows possible.

What if you could just "grep" your lakehouse?

Instead of exporting data to a separate system, or maintaining a parallel search index, we asked a simpler question: what if search was just a native capability of the lakehouse?

Not a different API. Not a different tool. Just something you can use directly in SQL.

So we started experimenting.

Making search feel like SQL

Coming from Algolia, we had a pretty clear idea of what "good search" should feel like: fast, typo-tolerant, and usable without configuration.

We combined a few building blocks:

DuckDB and our lakehouse infrastructure for storage and execution
Tantivy for full-text search and fuzzy matching
HNSW for vector similarity search
model2vec to compute embeddings with potion-base-8M & similars

None of these pieces are new on their own. The interesting part is how they fit together inside the same system, on top of the same storage layer, without forcing data into a parallel retrieval stack.

Rather than introducing a new interface, we extended SQL with two operators:

@@ for full-text search
<=> for semantic similarity

Both also expose a virtual score column reflecting textual relevance, so results can be ranked without leaving SQL. This keeps everything composable with the rest of the query engine.

Full-text search

You can search across all textual fields:

1
2
3
4
SELECT *
FROM altertable.zendesk_tickets
WHERE * @@ 'lakehouse'
ORDER BY score DESC

Or restrict it to a specific column:

1
2
3
4
SELECT ticket_id, subject, description
FROM altertable.zendesk_tickets
WHERE description @@ 'lakheouse' -- typo-tolerant search
ORDER BY score DESC

And, importantly, you can combine it with structured filters:

1
2
3
4
SELECT *
FROM altertable.zendesk_tickets
WHERE workspace_id = 'acme_eu'
  AND * @@ 'lakehouse'

Search is no longer a separate system. It becomes another operator in your queries.

Semantic search

On top of full-text search, we added semantic search using embeddings.

1
2
3
4
SELECT ticket_id, subject, description
FROM altertable.zendesk_tickets
WHERE * <=> 'GDPR concerns, data residency, and EU deployment'
ORDER BY score DESC

Instead of matching keywords, this retrieves results based on meaning.

You can refine it the same way:

1
2
3
4
5
SELECT *
FROM altertable.zendesk_tickets
WHERE workspace_id = 'acme_eu'
  AND * <=> 'customer conversations about data residency requirements'
ORDER BY score DESC

At this point, the query is no longer tied to specific columns. It expresses intent directly.

Search first, SQL second

This is where things start to feel different.

Traditional querying is schema-first: you start from tables and columns, then narrow things down. Search inverts that flow. You start from intent, retrieve a broad set of candidates, and refine with structure. That does not replace SQL. It makes it a better second step. Once you have retrieval primitives inside the lakehouse, you can build a much better agent workflow.

For example, an agent investigating GDPR-related support issues could:

use semantic search to retrieve likely relevant tickets
inspect those rows to identify recurring fields like workspace_id, priority, or ticket_type
run precise SQL aggregations on the narrowed subset

That can look like this:

1
2
3
4
5
6
7
8
9
WITH candidates AS (
  SELECT ticket_id, workspace_id, priority, created_at
  FROM altertable.zendesk_tickets
  WHERE * <=> 'GDPR, data residency, EU deployment concerns'
)
SELECT workspace_id, priority, COUNT(*) AS ticket_count
FROM candidates
GROUP BY 1, 2
ORDER BY ticket_count DESC;

The important point is not the exact syntax.

It is that retrieval and analysis happen in the same system, with the same tables, under the same query model.

Keep retrieval close to storage

If you zoom out, this problem is not new.

Over the years, we have kept adding systems for each new access pattern: Amplitude or PostHog for product analytics, Snowflake or BigQuery for BI, Pinecone or Vespa for embeddings and semantic retrieval. Each of these systems solves a real problem. But together they recreate the same pattern: data gets copied, pipelines multiply, and consistency gets harder to maintain.

The point is not that one engine should replace every tool in the stack. It is that we should be skeptical of a pattern where every new way to query data requires moving that data into a new system. If the data already lives in the lakehouse, the better question is whether we can add new retrieval modes on top of the same storage layer:

can BI, transformations, product workflows, and agents operate on the same underlying tables?
can search and SQL compose inside the same system instead of living in parallel stacks?

That is a narrower claim than "one system for everything". But it is also a more realistic one.

In that model, search is not a separate system anymore. It becomes just another way to access the same data:

SQL for precision
full-text for approximate matching
semantic search for intent

No synchronization. No duplication. No parallel indexing pipelines.

A better fit for how agents work

Agents do not start with schema knowledge. They start with intent, try a query, inspect the results, and refine from there. That is why search matters. Full-text and semantic search let an agent find relevant records before it knows the exact tables or columns. Once it finds the right slice, SQL becomes useful again for precise analysis.

This is also the larger architectural point. If search can live directly in the lakehouse, you do not need another side system just to support a new access pattern. You keep one storage layer and add another way to query it.

If your system cannot answer something like "show me anything related to GDPR issues" without knowing the schema first, then it is still missing a discovery layer. The semantic layer still matters. It makes governed analytics legible to both humans and agents. But open-ended agent workflows need another primitive alongside it: a native way to retrieve the right slice of data from intent alone, before the path through tables, columns, and metrics is obvious.

That is the shift: search is not a sidecar anymore. It becomes part of how the lakehouse is queried.

Share