RAG without Vectors – PageIndex: Reasoning-Based Document Indexing #18360

rejojer · 2025-04-03T07:02:59Z

rejojer
Apr 3, 2025

We were frustrated by vector-based RAG systems that rely on semantic similarity and often fail on long, domain-specific documents. In these contexts, domain-specific terminology tends to be semantically similar, making it hard to retrieve the exact content users need. It’s also difficult to incorporate expert knowledge or user preferences effectively. So we started exploring a more reasoning-driven approach to RAG. Inspired by the tree search algorithm in AlphaGo, we came up with a reasoning-based RAG system that uses tree search to guide retrieval.

We open-sourced one of the key components: PageIndex (Github repo), a hierarchical indexing system that transforms large documents (like financial reports, regulatory documents, or textbooks) into semantic trees optimized for reasoning-based RAG.

Some highlights:

Hierarchical Structure: Organizes lengthy PDFs into LLM-friendly trees — like a smart table of contents.
Precise Referencing: Each node includes a summary and exact physical page numbers.
Natural Segmentation: Nodes align with document sections, preserving context — no arbitrary chunking.

We've used PageIndex for financial document analysis with reasoning-based RAG and saw significant improvements in retrieval accuracy (98.7% on FinanceBench) compared to vector-based systems.

Would love any feedback — especially thoughts on reasoning-based RAG, or ideas for where PageIndex could be applied!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAG without Vectors – PageIndex: Reasoning-Based Document Indexing #18360

{{title}}

Replies: 0 comments

Select a reply

RAG without Vectors – PageIndex: Reasoning-Based Document Indexing #18360

rejojer Apr 3, 2025

Replies: 0 comments

rejojer
Apr 3, 2025