One pattern keeps showing up when we use code agents:
before they can do anything useful, they spend their time grepping to figure out where to look
At first, that sounds almost trivial. Of course you should be able to search your data.
But the more we thought about it, the more it looked like a structural limitation in how data systems are designed today.
The current stack works well when the question is already well formed:
- which metric moved?
- which segment changed?
- what does revenue look like by plan?
That is the world of BI, dashboards, and semantic layers.
But many high-value agent workflows do not start there. They start with a messier prompt:
- "find customers blocked on procurement"
- "show me everything that looks like churn risk"
- "what are enterprise users actually complaining about in Germany?"
Those are not metric questions first. They are retrieval questions first.
The query model assumes too much
Over the years, we've optimized everything around structured querying: better schemas, semantic layers, better metrics, better dashboards.
That progress matters. A good semantic layer makes raw data far more legible. It gives people and software a shared vocabulary for entities, metrics, and business concepts instead of forcing them to reason directly from tables and columns.
For BI-style workflows, that is often enough. If the job is "answer this KPI question correctly" or "translate this governed business question into SQL," the semantic layer is exactly the right abstraction.
But even that abstraction assumes you have a starting point. It helps once you roughly know what question to ask. It is much less helpful when the first problem is simply figuring out where the relevant signal lives.
You know the table.
You know the columns.
Or at least the business concepts that map onto them.
You know how to express the query.
That works well for humans and for text-to-SQL systems operating on known business concepts.
It breaks down quickly for agents exploring diverse operational data.
When an agent starts exploring a dataset, it often does not begin with a schema or a semantic model. It begins with intent:
- "find conversations about GDPR"
- "which customers asked about data residency"
- "anything related to EU deployment issues"
There is no clear entry point into the data, only a vague idea of what might be relevant.
SQL is precise, but not forgiving
SQL is extremely powerful, but it is also very strict. It requires you to be precise from the start: exact filters, exact joins, exact structure.
That precision is what makes it reliable. But it also makes it a poor first step when you do not yet know where to look.
You do not explore a dataset with perfect queries. You explore it by approximating, iterating, and refining.
Which is exactly what search is good at.
Text-to-SQL is no longer the hard part
A year ago, many conversations about data access for AI centered on text-to-SQL: can a model translate a natural-language request into syntactically valid SQL?
By now, that capability is close to table stakes. Generating acceptable SQL from a prompt is increasingly a commodity, especially when the target schema or semantic model is already reasonably well described.
That does not solve the actual retrieval problem.
If an agent does not know which table contains the relevant text, which column carries the signal, or even which records are worth inspecting, then producing SQL is not enough. You still end up with the same failure mode: a precise query aimed at the wrong place.
For builders, this means the missing primitive is not SQL generation.
It is schema-agnostic retrieval.
What agents actually need
In practice, agents need a loop that looks more like this:
- Start from intent.
- Retrieve candidate records broadly.
- Inspect the results and discover where the signal lives.
- Refine with structured filters, joins, and aggregations.
That is not "text-to-SQL."
It is closer to search first, SQL second.
Humans already work this way informally. We scan a few rows. We run a sloppy first query. We look at field values. We tighten the query once we understand the data shape.
Agents need the same ability, but as a native system capability rather than a workaround.
That matters operationally. If the agent can do discovery inside the same engine it will later use for joins, filters, and aggregations, you avoid bolting on a second retrieval stack just to make open-ended workflows possible.
What if you could just "grep" your lakehouse?
Instead of exporting data to a separate system, or maintaining a parallel search index, we asked a simpler question: what if search was just a native capability of the lakehouse?
Not a different API. Not a different tool. Just something you can use directly in SQL.
So we started experimenting.
Making search feel like SQL
Coming from Algolia, we had a pretty clear idea of what "good search" should feel like: fast, typo-tolerant, and usable without configuration.
We combined a few building blocks:
- DuckDB and our lakehouse infrastructure for storage and execution
- Tantivy for full-text search and fuzzy matching
- HNSW for vector similarity search
- model2vec to compute embeddings with
potion-base-8M& similars
None of these pieces are new on their own. The interesting part is how they fit together inside the same system, on top of the same storage layer, without forcing data into a parallel retrieval stack.
Rather than introducing a new interface, we extended SQL with two operators:
@@for full-text search<=>for semantic similarity
Both also expose a virtual score column reflecting textual relevance, so results can be ranked without leaving SQL. This keeps everything composable with the rest of the query engine.
Full-text search
You can search across all textual fields:
SELECT *FROM altertable.zendesk_ticketsWHERE * @@ 'lakehouse'ORDER BY score DESC
Or restrict it to a specific column:
SELECT ticket_id, subject, descriptionFROM altertable.zendesk_ticketsWHERE description @@ 'lakheouse' -- typo-tolerant searchORDER BY score DESC
And, importantly, you can combine it with structured filters:
SELECT *FROM altertable.zendesk_ticketsWHERE workspace_id = 'acme_eu'AND * @@ 'lakehouse'
Search is no longer a separate system. It becomes another operator in your queries.
Semantic search
On top of full-text search, we added semantic search using embeddings.
SELECT ticket_id, subject, descriptionFROM altertable.zendesk_ticketsWHERE * <=> 'GDPR concerns, data residency, and EU deployment'ORDER BY score DESC
Instead of matching keywords, this retrieves results based on meaning.
You can refine it the same way:
SELECT *FROM altertable.zendesk_ticketsWHERE workspace_id = 'acme_eu'AND * <=> 'customer conversations about data residency requirements'ORDER BY score DESC
At this point, the query is no longer tied to specific columns. It expresses intent directly.
Search first, SQL second
This is where things start to feel different.
Traditional querying is schema-first: you start from tables and columns, then narrow things down. Search inverts that flow. You start from intent, retrieve a broad set of candidates, and refine with structure. That does not replace SQL. It makes it a better second step. Once you have retrieval primitives inside the lakehouse, you can build a much better agent workflow.
For example, an agent investigating GDPR-related support issues could:
- use semantic search to retrieve likely relevant tickets
- inspect those rows to identify recurring fields like
workspace_id,priority, orticket_type - run precise SQL aggregations on the narrowed subset
That can look like this:
WITH candidates AS (SELECT ticket_id, workspace_id, priority, created_atFROM altertable.zendesk_ticketsWHERE * <=> 'GDPR, data residency, EU deployment concerns')SELECT workspace_id, priority, COUNT(*) AS ticket_countFROM candidatesGROUP BY 1, 2ORDER BY ticket_count DESC;
The important point is not the exact syntax.
It is that retrieval and analysis happen in the same system, with the same tables, under the same query model.
Keep retrieval close to storage
If you zoom out, this problem is not new.
Over the years, we have kept adding systems for each new access pattern: Amplitude or PostHog for product analytics, Snowflake or BigQuery for BI, Pinecone or Vespa for embeddings and semantic retrieval. Each of these systems solves a real problem. But together they recreate the same pattern: data gets copied, pipelines multiply, and consistency gets harder to maintain.
The point is not that one engine should replace every tool in the stack. It is that we should be skeptical of a pattern where every new way to query data requires moving that data into a new system. If the data already lives in the lakehouse, the better question is whether we can add new retrieval modes on top of the same storage layer:
- can BI, transformations, product workflows, and agents operate on the same underlying tables?
- can search and SQL compose inside the same system instead of living in parallel stacks?
That is a narrower claim than "one system for everything". But it is also a more realistic one.
In that model, search is not a separate system anymore. It becomes just another way to access the same data:
- SQL for precision
- full-text for approximate matching
- semantic search for intent
No synchronization. No duplication. No parallel indexing pipelines.
A better fit for how agents work
Agents do not start with schema knowledge. They start with intent, try a query, inspect the results, and refine from there. That is why search matters. Full-text and semantic search let an agent find relevant records before it knows the exact tables or columns. Once it finds the right slice, SQL becomes useful again for precise analysis.
This is also the larger architectural point. If search can live directly in the lakehouse, you do not need another side system just to support a new access pattern. You keep one storage layer and add another way to query it.
If your system cannot answer something like "show me anything related to GDPR issues" without knowing the schema first, then it is still missing a discovery layer. The semantic layer still matters. It makes governed analytics legible to both humans and agents. But open-ended agent workflows need another primitive alongside it: a native way to retrieve the right slice of data from intent alone, before the path through tables, columns, and metrics is obvious.
That is the shift: search is not a sidecar anymore. It becomes part of how the lakehouse is queried.





