Narrative-Aware Scientific Claim Verification Agent with Evidence Ranking for ClimateCheck
DOI:
https://doi.org/10.51903/jtie.v5i1.549Keywords:
scientific claim verification, climate misinformation, evidence ranking, narrative classification, ClimateCheck, BM25, rationale alignment, fact-checkingAbstract
Climate misinformation often combines a factual proposition with a recognizable narrative, such as denying observed warming, rejecting human causation, minimizing impacts, attacking mitigation, or casting doubt on climate science. This paper presents a lightweight narrative-aware scientific claim verification agent for the official ClimateCheck setting. The revised evaluation uses the official annotated ClimateCheck training data, the official publications corpus of 394,269 abstracts, and a claim-level validation split of the annotated data. The public ClimateCheck test file is treated as a blind claim list because its public fields do not contain verification or narrative labels. The system combines hashed BM25, TF-IDF retrieval, latent semantic analysis, narrative-family probabilities, and a logistic-regression verifier. Full-corpus retrieval shows that BM25 remains the strongest first-stage retriever, with Recall@10 = 0.466, while the narrative-aware hybrid obtains Recall@10 = 0.444. In the judged candidate reranking setting, the narrative-aware ranker obtains the highest Candidate Recall@1 = 0.789 and MAP = 0.848, compared with 0.759 and 0.843 for TF-IDF. End-to-end verification remains difficult: the BM25 top-1 pipeline reaches Macro-F1 = 0.408, while the narrative-aware pipeline reaches Macro-F1 = 0.355. Claim-level narrative evaluation no longer produces a perfect score; single-label top-family Macro-F1 is 0.422, and fine-grained multi-label CARDS-code Macro-F1 is 0.098. These results show that narrative information is useful for reranking already plausible evidence candidates, but it does not replace strong lexical retrieval and does not by itself solve claim verification.
References
Abu Ahmad, R., Upravitelev, M., Usmanova, A., Solopova, V., & Rehm, G. (2025). The ClimateCheck shared task: Scientific fact-checking of social media claims about climate change. Proceedings of the 5th Workshop on Scholarly Document Processing.
Abu Ahmad, R., Upravitelev, M., Usmanova, A., Solopova, V., & Rehm, G. (2026). ClimateCheck 2026: Scientific fact-checking and disinformation narrative classification of climate-related claims. arXiv preprint arXiv:2603.26449.
Augenstein, I., Lioma, C., Wang, D., Chaves Lima, L., Hansen, C., Hansen, C., & Simonsen, J. G. (2019). MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims. Proceedings of EMNLP-IJCNLP, 4685-4697.
Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. Proceedings of EMNLP-IJCNLP, 3615-3620.
Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. Proceedings of EMNLP, 632-642.
Cohan, A., Feldman, S., Beltagy, I., Downey, D., & Weld, D. S. (2020). S2ORC: The Semantic Scholar Open Research Corpus. Proceedings of ACL, 4969-4983.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL, 4171-4186.
Diggelmann, T., Boyd-Graber, J., Bulian, J., Ciaramita, M., & Leippold, M. (2020). CLIMATE-FEVER: A dataset for verification of real-world climate claims. arXiv preprint arXiv:2012.00614.
Ferreira, W., & Vlachos, A. (2016). Emergent: A novel data-set for stance classification. Proceedings of NAACL, 1163-1168.
Hanselowski, A., PVS, A., Schiller, B., Caspelherr, F., Chaudhuri, D., Meyer, C. M., & Gurevych, I. (2018). A retrospective analysis of the Fake News Challenge stance-detection task. Proceedings of COLING, 1859-1874.
Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. T. (2020). Dense passage retrieval for open-domain question answering. Proceedings of EMNLP, 6769-6781.
Khattab, O., & Zaharia, M. (2020). ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. Proceedings of SIGIR, 39-48.
Kotonya, N., & Toni, F. (2020). Explainable automated fact-checking for public health claims. Proceedings of EMNLP, 7740-7754.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Lewis, M., Yih, W. T., Rocktaschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459-9474.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to information retrieval. Cambridge University Press.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., VanderPlas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825-2830.
Priem, J., Piwowar, H., & Orr, R. (2022). OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv preprint arXiv:2205.01833.
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. Proceedings of EMNLP-IJCNLP, 3982-3992.
Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389.
Schuster, T., Fisch, A., & Barzilay, R. (2021). Get your vitamin C! Robust fact verification with contrastive evidence. Proceedings of NAACL, 624-643.
Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. (2018). FEVER: A large-scale dataset for fact extraction and verification. Proceedings of NAACL, 809-819.
Wadden, D., Lin, S., Lo, K., Wang, L. L., van Zuylen, M., Cohan, A., & Hajishirzi, H. (2020). Fact or fiction: Verifying scientific claims. Proceedings of EMNLP, 7534-7550.
Wang, W. Y. (2017). Liar, liar pants on fire: A new benchmark dataset for fake news detection. Proceedings of ACL, 422-426.
Williams, A., Nangia, N., & Bowman, S. R. (2018). A broad-coverage challenge corpus for sentence understanding through inference. Proceedings of NAACL, 1112-1122.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, S., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Xu, C., Le Scao, T., Gugger, S., & Rush, A. (2020). Transformers: State-of-the-art natural language processing. Proceedings of EMNLP: System Demonstrations, 38-45.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Wenhao Su, Siyu Chen, Ethan Qian

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

