Explainable Multi-Hop Question Answering for QA Assistants: Two-Hop Evidence Retrieval, Sentence-Level Supporting Facts, and Explicit Reasoning Paths

Xiaofei Luo

doi:10.51903/jtie.v5i1.504

Authors

Xiaofei Luo Information Science, University of Illinois at Urbana-Champaign, IL, US

DOI:

https://doi.org/10.51903/jtie.v5i1.504

Keywords:

Multi-Hop Question Answering, Explainable QA, Evidence Retrieval, Supporting Facts, Reasoning Paths

Abstract

Multi-hop question answering (QA) for customer-facing assistants requires not only accurate answers but also an auditable evidence trail that explains how the system arrived at each answer. We present a fully interpretable multi-hop QA pipeline that decomposes inference into three explicit modules—Retriever → Evidence Selector → Reasoner—and produces an explanation consisting of sentence-level supporting facts and an explicit two-hop evidence path. The retriever ranks candidate paragraphs using lexical IDF-weighted token overlap; the evidence selector chooses a small set of high-scoring sentences; and the reasoner extracts a final answer using weighted candidate phrase scoring and deterministic rules for date/number and constrained yes/no comparisons. We conduct full experimental evaluations on the complete development splits of HotpotQA (7,405 questions, distractor setting) and 2WikiMultihopQA (12,576 questions). On HotpotQA, sentence-level evidence selection improves Supporting Fact F1 from 0.334 to 0.419, and adding an explicit two-hop retrieval path further increases Supporting Fact F1 to 0.426 while raising paragraph recall@2 to 0.603. Answer F1 increases from 0.084 to 0.088. On 2WikiMultihopQA, evidence selection improves Supporting Fact F1 from 0.328 to 0.429 and Answer F1 from 0.071 to 0.075. These results quantify the contribution of explicit evidence selection and path-constrained retrieval to explainability and provide a practical, reproducible baseline for knowledge assistants that must justify answers with supporting facts.

References

Chen, D., Fisch, A., Weston, J., & Bordes, A. (2017). Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 1870–1879). Association for Computational Linguistics.

De Cao, N., Aziz, W., & Titov, I. (2019). Question answering by reasoning across documents with graph convolutional networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (pp. 2306–2317). Association for Computational Linguistics.

DeYoung, J., Jain, S., Rajani, N. F., Lehman, E., Xiong, C., Socher, R., & Wallace, B. C. (2020). ERASER: A benchmark to evaluate rationalized NLP models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 4443-4458). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.408

Fang, Y., Sun, S., Gan, Z., Pillai, R., Wang, S., & Liu, J. (2020). Hierarchical graph network for multi-hop question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 8823-8838). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.710

Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M.-W. (2020). Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning (ICML) (pp. 3929-3938). Proceedings of Machine Learning Research (Vol. 119). PMLR. https://proceedings.mlr.press/v119/guu20a.html

Ho, X., Sugawara, S., & Aizawa, A. (2020). Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics (COLING) (pp. 6609–6625). International Committee on Computational Linguistics.

Izacard, G., & Grave, E. (2021). Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Association for Computational Linguistics.

Jain, S., & Wallace, B. C. (2019). Attention is not explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (pp. 3543–3556). Association for Computational Linguistics.

Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W.-t. (2020). Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 6769–6781). Association for Computational Linguistics.

Khattab, O., & Zaharia, M. (2020). ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (pp. 39-48). Association for Computing Machinery. https://doi.org/10.1145/3397271.3401075

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kulkarni, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33.

Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30.

Miller, T. (2019). Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267, 1–38. https://doi.org/10.1016/j.artint.2018.07.007

Min, S., Chen, D., Hajishirzi, H., & Zettlemoyer, L. (2019). Multi-hop reading comprehension through question decomposition and rescoring. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 6097–6109). Association for Computational Linguistics.

Nogueira, R., & Cho, K. (2019). Passage re-ranking with BERT. arXiv. https://arxiv.org/abs/1901.04085

Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 2383–2392). Association for Computational Linguistics.

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (pp. 1135–1144). Association for Computing Machinery. https://doi.org/10.1145/2939672.2939778

Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval, 3(4), 333–389. https://doi.org/10.1561/1500000019

Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. (2018). FEVER: A large-scale dataset for fact extraction and verification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (pp. 809–819). Association for Computational Linguistics.

Trivedi, H., Balasubramanian, N., Khot, T., & Sabharwal, A. (2022). MuSiQue: Multihop questions via single-hop question composition. arXiv. https://arxiv.org/abs/2205.09682

Welbl, J., Stenetorp, P., & Riedel, S. (2018). Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics, 6, 287–302.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E. H., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35.

Wiegreffe, S., & Pinter, Y. (2019). Attention is not not explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 11–20). Association for Computational Linguistics.

Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., & Manning, C. D. (2018). HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 2369–2380). Association for Computational Linguistics.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing reasoning and acting in language models. arXiv. https://arxiv.org/abs/2210.03629