LLM-Inspired Offline Reranking for Financial Search: Query Rewriting, Hybrid Retrieval, and Listwise Relevance Ranking on FiQA

Siquan  Meng; Jing  Chen; Isa  Zheng

doi:10.51903/jtie.v5i1.537

Authors

Siquan Meng Applied Business Analytics, Boston University, MA, USA
Jing Chen Industrial Engineering and Operations Research, UCB, CA, USA
Isa Zheng Information Technology, Carnegie Mellon University, PA, USA

DOI:

https://doi.org/10.51903/jtie.v5i1.537

Keywords:

Financial information retrieval, FiQA, BEIR, query rewriting, hybrid retrieval, BM25, dense retrieval, LLM reranking

Abstract

Financial search has high practical value because investors and retail users often ask natural-language questions whose wording differs from relevant financial passages. This paper evaluates a multi-stage retrieval pipeline on FiQA, a financial question-answering retrieval collection in BEIR. The systems include BM25, Dense LSA, BM25-LSA hybrid retrieval, reciprocal-rank fusion, a compact linear reranker, fixed pointwise and listwise relevance rubrics inspired by LLM reranking, query rewriting, and the proposed query rewriting plus hybrid retrieval plus listwise reranking pipeline. The evaluation used the full 57,638-document FiQA corpus, 6,648 available queries, and the 648-query BEIR FiQA test qrels with 1,706 binary relevance judgments. BM25 was the best-performing system, with nDCG@10 = 0.2285, MAP = 0.1863, MRR = 0.2994, and Recall@100 = 0.5207. The proposed full pipeline underperformed BM25. The listwise rubric ranked second on nDCG@10 (0.2228) and improved over the pointwise rubric, suggesting that candidate-list normalization can be useful in this setting. The rubric rerankers are fixed local scoring rules, so these results should be read as an evaluation of LLM-inspired ranking logic rather than as a benchmark of an actual prompt-based LLM reranker. Dense LSA retrieval alone was weak (nDCG@10 = 0.0287), which shows the limitation of a conservative non-neural dense baseline in financial semantic matching. Query rewriting reduced average effectiveness. The findings recommend strong lexical baselines, conservative rewrite gating, and careful evaluation before adopting prompt-based or model-based LLM rerankers in financial search.

References

Amati, G., & van Rijsbergen, C. J. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20(4), 357–389. https://doi.org/10.1145/582415.582416

Binghua Zhou, Siming Zhao, & David Chao. (2023). LLM-Guided Energy-Aware A/B Testing for Consolidation and DVFS Policies via Power-Sensitivity Clustering. Journal of Advanced Computing Systems , 3(4), 12-30. https://doi.org/10.69987/JACS.2023.30402

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. In Advances in Neural Information Processing Systems (Vol. 33, pp. 1877–1901).

Cormack, G. V., Clarke, C. L. A., & Buettcher, S. (2009). Reciprocal rank fusion outperforms Condorcet and individual rank learning methods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 758–759). https://doi.org/10.1145/1571941.1572114

Daren Zheng, & Chenyu Li. (2024). Behavior-Level Jailbreak Resistance via Multi-Stage Refusal + Utility Preservation. Journal of Advanced Computing Systems , 4(1), 83-99. https://doi.org/10.69987/JACS.2024.40107

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT 2019 (pp. 4171–4186). https://doi.org/10.18653/v1/N19-1423

Gao, L., Ma, X., Lin, J., & Callan, J. (2023). Precise zero-shot dense retrieval without relevance labels. In Proceedings of ACL 2023 (pp. 1762–1777). https://doi.org/10.18653/v1/2023.acl-long.99

Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4), 422–446. https://doi.org/10.1145/582415.582418

Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W.-t. (2020). Dense passage retrieval for open-domain question answering. In Proceedings of EMNLP 2020 (pp. 6769–6781). https://doi.org/10.18653/v1/2020.emnlp-main.550

Khattab, O., & Zaharia, M. (2020). ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. In Proceedings of SIGIR 2020 (pp. 39–48). https://doi.org/10.1145/3397271.3401075

Kuo, M.-J., Zheng, D., & Hires, J. (2025). Federated topic-preference learning for knowledge-grounded chat with differential privacy. Journal of Technology Informatics and Engineering, 4(2). https://doi.org/10.51903/jtie.v4i2.502

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (Vol. 33, pp. 9459–9474).

Maia, M., Handschuh, S., Freitas, A., Davis, B., McDermott, R., Zarrouk, M., & Balahur, A. (2018). WWW’18 open challenge: Financial opinion mining and question answering. In Companion Proceedings of the Web Conference 2018 (pp. 1941–1942). https://doi.org/10.1145/3184558.3192301

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.

Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2023). MTEB: Massive text embedding benchmark. In Proceedings of EACL 2023 (pp. 2014–2037). https://doi.org/10.18653/v1/2023.eacl-main.148

Nogueira, R., & Cho, K. (2019). Passage re-ranking with BERT. arXiv. https://arxiv.org/abs/1901.04085

Nogueira, R., Jiang, Z., & Lin, J. (2020). Document ranking with a pretrained sequence-to-sequence model. In Findings of EMNLP 2020 (pp. 708–718). https://doi.org/10.18653/v1/2020.findings-emnlp.63

Pradeep, R., Sharifymoghaddam, S., & Lin, J. (2023a). RankVicuna: Zero-shot listwise document reranking with open-source large language models. arXiv. https://arxiv.org/abs/2309.15088

Pradeep, R., Sharifymoghaddam, S., & Lin, J. (2023b). RankZephyr: Effective and robust zero-shot listwise reranking is a breeze! arXiv. https://arxiv.org/abs/2312.02724

Qin, Z., Jagerman, R., Hui, K., Zhuang, H., Wu, J., Yan, L., Shen, J., Liu, T., Liu, J., Metzler, D., Wang, X., & Bendersky, M. (2023). Large language models are effective text rankers with pairwise ranking prompting. arXiv. https://arxiv.org/abs/2306.17563

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of EMNLP-IJCNLP 2019 (pp. 3982–3992). https://doi.org/10.18653/v1/D19-1410

Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M. M., & Gatford, M. (1995). Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC-3) (pp. 109–126). National Institute of Standards and Technology.

Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389. https://doi.org/10.1561/1500000019

Sun, W., Yan, L., Ma, X., Wang, S., Ren, P., Chen, Z., Yin, D., & Ren, Z. (2023). Is ChatGPT good at search? Investigating large language models as re-ranking agents. In Proceedings of EMNLP 2023 (pp. 14918–14937). https://doi.org/10.18653/v1/2023.emnlp-main.923

Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Advances in Neural Information Processing Systems (Vol. 34, pp. 7981–7997).

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (Vol. 30).

Voorhees, E. M. (2002). The philosophy of information retrieval evaluation. In Revised Papers from CLEF 2001 (pp. 355–370). Springer. https://doi.org/10.1007/3-540-45691-0_34

Wang, L., Yang, N., Huang, X., Jiao, B., Jiang, D., Majumder, R., & Wei, F. (2022). Text embeddings by weakly-supervised contrastive pre-training. arXiv. https://arxiv.org/abs/2212.03533

Xinzhuo Sun, Jing Chen, Binghua Zhou, & Meng-Ju Kuo. (2024). ConRAG: Contradiction-Aware Retrieval-Augmented Generation under Multi-Source Conflicting Evidence. Journal of Advanced Computing Systems , 4(7), 50-64. https://doi.org/10.69987/JACS.2024.40705