This tree search framework hits 98.7% on documents where vector search fails

Thank you for reading this post, don't forget to subscribe!

A new open-source framework called PageIndex solves one of the old problems of retrieval-augmented generation (RAG): handling very long documents.

The classic RAG workflow (chunk documents, calculate embeddings, store them in a vector database, and retrieve the top matches based on semantic similarity) works well for basic tasks such as Q&A over small documents.

PageIndex abandons the standard "chunk-and-embed" method entirely and treats document retrieval not as a search problem, but as a navigation problem.

But as enterprises try to move RAG into high-stakes workflows — auditing financial statements, analyzing legal contracts, navigating pharmaceutical protocols — they're hitting an accuracy barrier that chunk optimization can't solve.

AlphaGo for documents

PageIndex addresses these limitations by borrowing a concept from game-playing AI rather than search engines: tree search.

When humans need to find specific information in a dense textbook or a long annual report, they do not scan every paragraph linearly. They consult the table of contents to identify the relevant chapter, then the section, and finally the specific page. PageIndex forces the LLM to replicate this human behavior.

Instead of pre-calculating vectors, the framework builds a "Global Index" of the document's structure, creating a tree where nodes represent chapters, sections, and subsections. When a query arrives, the LLM performs a tree search, explicitly classifying each node as relevant or irrelevant based on the full context of the user's request.

"In computer science terms, a table of contents is a tree-structured representation of a document, and navigating it corresponds to tree search," Zhang said. "PageIndex applies the same core idea — tree search — to document retrieval, and can be thought of as an AlphaGo-style system for retrieval rather than for games."

This shifts the architectural paradigm from passive retrieval, where the system simply fetches matching text, to active navigation, where an agentic model decides where to look.

The limits of semantic similarity

There is a fundamental flaw in how traditional RAG handles complex data. Vector retrieval assumes that the text most semantically similar to a user’s query is also the most relevant. In professional domains, this assumption frequently breaks down.

Mingtian Zhang, co-founder of PageIndex, points to financial reporting as a prime example of this failure mode. If a financial analyst asks an AI about "EBITDA" (earnings before interest, taxes, depreciation, and amortization), a standard vector database will retrieve every chunk where that acronym or a similar term appears.

"Multiple sections may mention EBITDA with similar wording, yet only one section defines the precise calculation, adjustments, or reporting scope relevant to the question," Zhang told VentureBeat. "A similarity based retriever struggles to distinguish these cases because the semantic signals are nearly indistinguishable."

This is the "intent vs. content" gap. The user does not want to find the word "EBITDA"; they want to understand the “logic” behind it for that specific quarter.

Furthermore, traditional embeddings strip the query of its context. Because embedding models have strict input-length limits, the retrieval system usually only sees the specific question being asked, ignoring the previous turns of the conversation. This detaches the retrieval step from the user’s reasoning process. The system matches documents against a short, decontextualized query rather than the full history of the problem the user is trying to solve.

Solving the multi-hop reasoning problem

The real-world impact of this structural approach is most visible in "multi-hop" queries that require the AI to follow a trail of breadcrumbs across different parts of a document.

In a recent benchmark test known as FinanceBench, a system built on PageIndex called "Mafin 2.5" achieved a state-of-the-art accuracy score of 98.7%. The performance gap between this approach and vector-based systems becomes clear when analyzing how they handle internal references.

Zhang offers the example of a query regarding the total value of deferred assets in a Federal Reserve annual report. The main section of the report describes the “change” in value but does not list the total. However, the text contains a footnote: “See Appendix G of this report … for more detailed information.”

A vector-based system typically fails here. The text in Appendix G looks nothing like the user’s query about deferred assets; it is likely just a table of numbers. Because there is no semantic match, the vector database ignores it.

The reasoning-based retriever, however, reads the cue in the main text, follows the structural link to Appendix G, locates the correct table, and returns the accurate figure.

The latency trade-off and infrastructure shift

For enterprise architects, the immediate concern with an LLM-driven search process is latency. Vector lookups occur in milliseconds; having an LLM "read" a table of contents implies a significantly slower user experience.

However, Zhang explains that the perceived latency for the end-user may be negligible due to how the retrieval is integrated into the generation process. In a classic RAG setup, retrieval is a blocking step: the system must search the database before it can begin generating an answer. With PageIndex, retrieval happens inline, during the model’s reasoning process.

"The system can start streaming immediately, and retrieve as it generates," Zhang said. "That means PageIndex does not add an extra 'retrieval gate' before the first token, and Time to First Token (TTFT) is comparable to a normal LLM call."

This architectural shift also simplifies the data infrastructure. By removing reliance on embeddings, enterprises no longer need to maintain a dedicated vector database. The tree-structured index is lightweight enough to sit in a traditional relational database like PostgreSQL.

This addresses a growing pain point in LLM systems with retrieval components: the complexity of keeping vector stores in sync with living documents. PageIndex separates structure indexing from text extraction. If a contract is amended or a policy updated, the system can handle small edits by re-indexing only the affected subtree rather than reprocessing the entire document corpus.

A decision matrix for the enterprise

While the accuracy gains are compelling, tree-search retrieval is not a universal replacement for vector search. The technology is best viewed as a specialized tool for "deep work" rather than a catch-all for every retrieval task.

For short documents, such as emails or chat logs, the entire context often fits within a modern LLM’s context window, making any retrieval system unnecessary. Conversely, for tasks purely based on semantic discovery, such as recommending similar products or finding content with a similar "vibe," vector embeddings remain the superior choice because the goal is proximity, not reasoning.

PageIndex fits squarely in the middle: long, highly structured documents where the cost of error is high. This includes technical manuals, FDA filings, and merger agreements. In these scenarios, the requirement is auditability. An enterprise system needs to be able to explain not just the answer, but the path it took to find it (e.g., confirming that it checked Section 4.1, followed the reference to Appendix B, and synthesized the data found there).

The future of agentic retrieval

The rise of frameworks like PageIndex signals a broader trend in the AI stack: the move toward "Agentic RAG." As models become more capable of planning and reasoning, the responsibility for finding data is moving from the database layer to the model layer.

We are already seeing this in the coding space, where agents like Claude Code and Cursor are moving away from simple vector lookups in favor of active codebase exploration. Zhang believes generic document retrieval will follow the same trajectory.

"Vector databases still have suitable use cases," Zhang said. "But their historical role as the default database for LLMs and AI will become less clear over time."

Source link