Databricks research shows multi-step agents consistently outperform single-turn RAG when answers span databases and documents

Thank you for reading this post, don't forget to subscribe!

Data teams building AI agents keep running into the same failure mode. Questions that require joining structured data with unstructured content, sales figures alongside customer reviews or citation counts alongside academic papers, break single-turn RAG systems.

New research from Databricks puts a number on that failure gap. The company's AI research team tested a multi-step agentic approach against state-of-the-art single-turn RAG baselines across nine enterprise knowledge tasks, reporting gains of 20% or more on Stanford's STaRK benchmark suite and consistent improvement across Databricks' own KARLBench evaluation framework. The results make the case that the performance gap between single-turn RAG and multi-step agents on hybrid data tasks is an architectural problem, not a model quality problem.

The work builds on Databricks' earlier instructed retriever research, which showed retrieval improvements on unstructured data using metadata-aware queries. This latest research adds structured data sources, relational tables and SQL warehouses, into the same reasoning loop, addressing the class of questions enterprises most commonly fail to answer with current agent architectures.

"RAG works, but it doesn't scale," Michael Bendersky, research director at Databricks, told VentureBeat. "If you want to make your agent even better, and you want to understand why you have declining sales, now you have to help the agent see the tables and look at the sales data. Your RAG pipeline will become incompetent at that task."

Single-turn retrieval cannot encode structural constraints

The core finding is that standard RAG systems fail when a query mixes a precise structured filter with an open-ended semantic search.

Consider a question like "Which of our products have had declining sales over the past three months, and what potentially related issues are brought up in customer reviews on various seller sites?" The sales data lives in a warehouse. The review sentiment lives in unstructured documents across seller sites. A single-turn RAG system cannot split that query, route each half to the right data source and combine the results.

To confirm this is an architecture problem rather than a model quality problem, Databricks reran published STaRK baselines using a current state-of-the-art foundation model. The stronger model still lost to the multi-step agent by 21% on the academic domain and 38% on the biomedical domain.

STaRK is a benchmark published by Stanford researchers covering three semi-structured retrieval domains: Amazon product data, the Microsoft Academic Graph and a biomedical knowledge base.

How the Supervisor Agent handles what RAG cannot

Databricks built the Supervisor Agent as the production implementation of this research approach, and its architecture illustrates why the gains are consistent across task types. The approach includes three core steps:

Parallel tool decomposition. Rather than issuing one broad query and hoping the results cover both structured and unstructured needs, the agent fires SQL and vector search calls simultaneously, then analyzes the combined results before deciding what to do next. That parallel step is what allows it to handle queries that cross data type boundaries without requiring the data to be normalized first.

Self-correction. When an initial retrieval attempt hits a dead end, the agent detects the failure, reformulates the query and tries a different path. On a STaRK benchmark task that requires finding a paper by an author with exactly 115 prior publications on a specific topic, the agent first queries both SQL and vector search in parallel. When the two result sets show no overlap, it adapts and issues a SQL JOIN across both constraints, then calls the vector search system to verify the result before returning the answer.

Declarative configuration. The agent is not tuned to any specific dataset or task. Connecting it to a new data source means writing a plain-language description of what that source contains and what kinds of questions it should answer. No custom code is required.

"The agent can do things like decomposing the question into a SQL query and a search query out of the box," Bendersky said. "It can combine the results of SQL and RAG, reason about those results, make follow-up queries and then reason about whether the final answer was actually found."

It's not just about hybrid retrieval

Being able to source information from both structured and unstructured data isn't an entirely new concept.

LlamaIndex, LangChain and Microsoft Fabric agents all offer some form of hybrid retrieval. Bendersky draws a distinction in how the Databricks approach frames the problem architecturally.

"We almost don't see it as a hybrid retrieval where you combine embeddings and search results, or embeddings and tables," he said. "We see this more as an agent that has access to multiple tools."

The practical consequence of that framing is that adding a new data source means connecting it to the agent and writing a description of what it contains. The agent handles routing and orchestration without additional code.

Custom RAG pipelines require data to be converted into a format the retrieval system can read, typically text chunks with embeddings. SQL tables have to be flattened, JSON has to be normalized. Every new data source added to the pipeline means more conversion work. Databricks' research argues that as enterprise data grows to include more source types, that burden makes custom pipelines increasingly impractical compared to an agent that queries each source in its native format.

"Just bring the agent to the data," Bendersky said. "You basically give the agent more sources, and it will learn to use them pretty well."

What this means for enterprises

For data engineers evaluating whether to build custom RAG pipelines or adopt a declarative agent framework, the research offers a clear direction: if the task involves questions that span structured and unstructured data, building custom retrieval is the harder path. The research found that across all tested tasks, the only things that differed between deployments were instructions and tool descriptions. The agent handled the rest.

The practical limits are real but manageable. The approach works well with five to ten data sources. Adding too many at once, without curating which sources are complementary rather than contradictory, makes the agent slower and less reliable. Bendersky recommends scaling incrementally and verifying results at each step rather than connecting all available data upfront.

Data accuracy is a prerequisite. The agent can query across mismatched formats, JSON review feeds alongside SQL sales tables, without requiring normalization. It cannot fix source data that is factually wrong. Adding a plain-language description of each data source at ingestion time helps the agent route queries correctly from the start.

The research positions this as an early step in a longer trajectory. As enterprise AI workloads mature, agents will be expected to reason across dozens of source types, including dashboards, code repositories and external data feeds. The research argues the declarative approach is what makes that scaling tractable, because adding a new source stays a configuration problem rather than an engineering one.

"This is kind of like a ladder," Bendersky said. "The agent will slowly get more and more information and then slowly improve overall."

Source link