You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We were frustrated by vector-based RAG systems that rely on semantic similarity and often fail on long, domain-specific documents. In these contexts, domain-specific terminology tends to be semantically similar, making it hard to retrieve the exact content users need. It’s also difficult to incorporate expert knowledge or user preferences effectively. So we started exploring a more reasoning-driven approach to RAG. Inspired by the tree search algorithm in AlphaGo, we came up with a reasoning-based RAG system that uses tree search to guide retrieval.
We open-sourced one of the key components: PageIndex (Github repo), a hierarchical indexing system that transforms large documents (like financial reports, regulatory documents, or textbooks) into semantic trees optimized for reasoning-based RAG.
Some highlights:
Hierarchical Structure: Organizes lengthy PDFs into LLM-friendly trees — like a smart table of contents.
Precise Referencing: Each node includes a summary and exact physical page numbers.
Natural Segmentation: Nodes align with document sections, preserving context — no arbitrary chunking.
We've used PageIndex for financial document analysis with reasoning-based RAG and saw significant improvements in retrieval accuracy (98.7% on FinanceBench) compared to vector-based systems.
Would love any feedback — especially thoughts on reasoning-based RAG, or ideas for where PageIndex could be applied!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
We were frustrated by vector-based RAG systems that rely on semantic similarity and often fail on long, domain-specific documents. In these contexts, domain-specific terminology tends to be semantically similar, making it hard to retrieve the exact content users need. It’s also difficult to incorporate expert knowledge or user preferences effectively. So we started exploring a more reasoning-driven approach to RAG. Inspired by the tree search algorithm in AlphaGo, we came up with a reasoning-based RAG system that uses tree search to guide retrieval.
We open-sourced one of the key components: PageIndex (Github repo), a hierarchical indexing system that transforms large documents (like financial reports, regulatory documents, or textbooks) into semantic trees optimized for reasoning-based RAG.
Some highlights:
We've used PageIndex for financial document analysis with reasoning-based RAG and saw significant improvements in retrieval accuracy (98.7% on FinanceBench) compared to vector-based systems.
Would love any feedback — especially thoughts on reasoning-based RAG, or ideas for where PageIndex could be applied!
Beta Was this translation helpful? Give feedback.
All reactions