Narrative-Aware Scientific Claim Verification Agent with Evidence Ranking for ClimateCheck

Wenhao  Su; Siyu  Chen; Ethan  Qian

doi:10.51903/jtie.v5i1.549

Authors

Wenhao Su Computer Science, UCSD, CA, USA
Siyu Chen Information Management, University of Illinois Urbana-Champaign, IL, USA
Ethan Qian Computer Science, USC, CA, USA

DOI:

https://doi.org/10.51903/jtie.v5i1.549

Keywords:

scientific claim verification, climate misinformation, evidence ranking, narrative classification, ClimateCheck, BM25, rationale alignment, fact-checking

Abstract

Climate misinformation often combines a factual proposition with a recognizable narrative, such as denying observed warming, rejecting human causation, minimizing impacts, attacking mitigation, or casting doubt on climate science. This paper presents a lightweight narrative-aware scientific claim verification agent for the official ClimateCheck setting. The revised evaluation uses the official annotated ClimateCheck training data, the official publications corpus of 394,269 abstracts, and a claim-level validation split of the annotated data. The public ClimateCheck test file is treated as a blind claim list because its public fields do not contain verification or narrative labels. The system combines hashed BM25, TF-IDF retrieval, latent semantic analysis, narrative-family probabilities, and a logistic-regression verifier. Full-corpus retrieval shows that BM25 remains the strongest first-stage retriever, with Recall@10 = 0.466, while the narrative-aware hybrid obtains Recall@10 = 0.444. In the judged candidate reranking setting, the narrative-aware ranker obtains the highest Candidate Recall@1 = 0.789 and MAP = 0.848, compared with 0.759 and 0.843 for TF-IDF. End-to-end verification remains difficult: the BM25 top-1 pipeline reaches Macro-F1 = 0.408, while the narrative-aware pipeline reaches Macro-F1 = 0.355. Claim-level narrative evaluation no longer produces a perfect score; single-label top-family Macro-F1 is 0.422, and fine-grained multi-label CARDS-code Macro-F1 is 0.098. These results show that narrative information is useful for reranking already plausible evidence candidates, but it does not replace strong lexical retrieval and does not by itself solve claim verification.

References

Abu Ahmad, R., Upravitelev, M., Usmanova, A., Solopova, V., & Rehm, G. (2025). The ClimateCheck shared task: Scientific fact-checking of social media claims about climate change. Proceedings of the 5th Workshop on Scholarly Document Processing.

Abu Ahmad, R., Upravitelev, M., Usmanova, A., Solopova, V., & Rehm, G. (2026). ClimateCheck 2026: Scientific fact-checking and disinformation narrative classification of climate-related claims. arXiv preprint arXiv:2603.26449.

Augenstein, I., Lioma, C., Wang, D., Chaves Lima, L., Hansen, C., Hansen, C., & Simonsen, J. G. (2019). MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims. Proceedings of EMNLP-IJCNLP, 4685-4697.

Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. Proceedings of EMNLP-IJCNLP, 3615-3620.

Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. Proceedings of EMNLP, 632-642.

Cohan, A., Feldman, S., Beltagy, I., Downey, D., & Weld, D. S. (2020). S2ORC: The Semantic Scholar Open Research Corpus. Proceedings of ACL, 4969-4983.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL, 4171-4186.

Diggelmann, T., Boyd-Graber, J., Bulian, J., Ciaramita, M., & Leippold, M. (2020). CLIMATE-FEVER: A dataset for verification of real-world climate claims. arXiv preprint arXiv:2012.00614.

Ferreira, W., & Vlachos, A. (2016). Emergent: A novel data-set for stance classification. Proceedings of NAACL, 1163-1168.

Hanselowski, A., PVS, A., Schiller, B., Caspelherr, F., Chaudhuri, D., Meyer, C. M., & Gurevych, I. (2018). A retrospective analysis of the Fake News Challenge stance-detection task. Proceedings of COLING, 1859-1874.

Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. T. (2020). Dense passage retrieval for open-domain question answering. Proceedings of EMNLP, 6769-6781.

Khattab, O., & Zaharia, M. (2020). ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. Proceedings of SIGIR, 39-48.

Kotonya, N., & Toni, F. (2020). Explainable automated fact-checking for public health claims. Proceedings of EMNLP, 7740-7754.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Lewis, M., Yih, W. T., Rocktaschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459-9474.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.

Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to information retrieval. Cambridge University Press.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., VanderPlas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825-2830.

Priem, J., Piwowar, H., & Orr, R. (2022). OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv preprint arXiv:2205.01833.

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. Proceedings of EMNLP-IJCNLP, 3982-3992.

Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389.

Schuster, T., Fisch, A., & Barzilay, R. (2021). Get your vitamin C! Robust fact verification with contrastive evidence. Proceedings of NAACL, 624-643.

Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. (2018). FEVER: A large-scale dataset for fact extraction and verification. Proceedings of NAACL, 809-819.

Wadden, D., Lin, S., Lo, K., Wang, L. L., van Zuylen, M., Cohan, A., & Hajishirzi, H. (2020). Fact or fiction: Verifying scientific claims. Proceedings of EMNLP, 7534-7550.

Wang, W. Y. (2017). Liar, liar pants on fire: A new benchmark dataset for fake news detection. Proceedings of ACL, 422-426.

Williams, A., Nangia, N., & Bowman, S. R. (2018). A broad-coverage challenge corpus for sentence understanding through inference. Proceedings of NAACL, 1112-1122.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, S., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Xu, C., Le Scao, T., Gugger, S., & Rush, A. (2020). Transformers: State-of-the-art natural language processing. Proceedings of EMNLP: System Demonstrations, 38-45.

Narrative-Aware Scientific Claim Verification Agent with Evidence Ranking for ClimateCheck

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

full sidebar