Budgeted Multi-Hop Retrieval Agent for Compositional Question Answering: A Retrieval-Policy Evaluation on the Official MultiHop-RAG Benchmark

Wenhao  Su; Siyu  Chen; Chloe  Zhao

doi:10.51903/jtie.v4i3.543

Authors

Wenhao Su Computer Science, UCSD, CA, USA
Siyu Chen Information Management, UIUC, IL, USA
Chloe Zhao Data Science, Columbia University, NY, USA

DOI:

https://doi.org/10.51903/jtie.v4i3.543

Keywords:

retrieval-augmented generation, multi-hop question answering, compositional retrieval, evidence recall, budgeted retrieval, query decomposition, answer exact match

Abstract

Multi-hop question answering requires a retrieval system to assemble several complementary evidence documents before an answer module can reason reliably. Single-shot retrieval is efficient, but it often misses later-hop evidence when a question combines source, time, comparison, and entity constraints. This paper evaluates a budgeted multi-hop retrieval agent for compositional question answering on the official MultiHop-RAG benchmark. The benchmark contains 2,556 queries and 609 news-article corpus documents, with answerable evidence distributed across two to four documents. Four retrieval policies are compared under the same sparse lexical scorer: fixed top-k retrieval, iterative retrieval, query decomposition, and the proposed budgeted retrieval agent. The revised evaluation frames the task as retrieval-policy evaluation rather than as a full free-form generative QA system: retrieval-conditioned EM/F1 are reported together with evidence recall, MRR, retrieval rounds, selected documents, and context-token cost. On the official data, the budgeted agent achieves the strongest overall retrieval-conditioned EM/F1 at 62.75% and the highest final evidence recall at 74.67%, using 3.011 average retrieval calls and 509.7 average context tokens. Query decomposition improves over fixed top-k and iterative retrieval but is less stable across question types. Fixed top-k is cheapest but incomplete on longer chains. The four-hop results remain difficult for every policy, showing that a fixed 620-token controller should be extended with hop-aware or dynamic budget allocation. The findings support a moderated contribution claim: explicit budget control is useful for auditable multi-hop retrieval, but it should be evaluated as a cost-accuracy trade-off rather than as a universally dominant RAG architecture.

References

Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2023). Self-RAG: Learning to retrieve, generate, and critique through self-reflection. arXiv.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., ... Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877-1901.

Chen, D., Fisch, A., Weston, J., & Bordes, A. (2017). Reading Wikipedia to answer open-domain questions. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 1870-1879.

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., & Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv.

Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M.-W. (2020). Retrieval augmented language model pre-training. Proceedings of the 37th International Conference on Machine Learning, 3929-3938.

Ho, X., Duong Nguyen, A.-K., Sugawara, S., & Aizawa, A. (2020). Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. Proceedings of the 28th International Conference on Computational Linguistics, 6609-6625.

Izacard, G., & Grave, E. (2021). Leveraging passage retrieval with generative models for open-domain question answering. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, 874-880.

Jiang, Y., Bordia, S., Zhong, Z., Dognin, P., Singh, M., & Bansal, M. (2020). HoVer: A dataset for many-hop fact extraction and claim verification. Findings of the Association for Computational Linguistics: EMNLP 2020, 3441-3460.

Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W.-t. (2020). Dense passage retrieval for open-domain question answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 6769-6781.

Khattab, O., Santhanam, K., Li, X. L., Hall, D., Liang, P., Potts, C., & Zaharia, M. (2021). Baleen: Robust multi-hop reasoning at scale via condensed retrieval. Advances in Neural Information Processing Systems, 34, 27670-27682.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W.-t., Rocktaschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459-9474.

Nogueira, R., & Cho, K. (2019). Passage re-ranking with BERT. arXiv.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730-27744.

Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389.

Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513-523.

Schick, T., Dwivedi-Yu, J., Dessi, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36.

Tang, Y., & Yang, Y. (2024). MultiHop-RAG: Benchmarking retrieval-augmented generation for multi-hop queries. Proceedings of the Conference on Language Modeling.

Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. (2018). FEVER: A large-scale dataset for fact extraction and verification. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, 809-819.

Trivedi, H., Balasubramanian, N., Khot, T., & Sabharwal, A. (2022). MuSiQue: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10, 539-554.

Wadden, D., Lin, S., Lo, K., Wang, L. L., van Zuylen, M., Cohan, A., & Hajishirzi, H. (2020). Fact or fiction: Verifying scientific claims. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 7534-7550.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824-24837.

Welbl, J., Stenetorp, P., & Riedel, S. (2018). Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics, 6, 287-302.

Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., & Manning, C. D. (2018). HotpotQA: A dataset for diverse, explainable multi-hop question answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2369-2380.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing reasoning and acting in language models. International Conference on Learning Representations.