Bias and Hallucination Evaluation in LLMs
DOI:
https://doi.org/10.51903/jtie.v5i1.476Keywords:
Bias, Hallucination, Large Language Models, Causal Inference, Retrieval-Augmented GenerationAbstract
The largest failure modes of LLMs to-date, bias and hallucination, have measurable harms in contexts where factuality and fairness are paramount. Both areas have experienced significant research growth; however, prior work on each generally operates as a disparate body of research, and there is a gap in a methodological framework for jointly measuring, tracing, and reducing both under the same experimental conditions. We provide that framework through an empirical evaluation (not a survey) of bias propagation and hallucination generation on four illustrative domains (medical, legal, finance, human resources) through a framework that addresses the three research questions: how can bias and hallucination be measured simultaneously through a replicable, domain-specific protocol; which techniques yield statistically meaningful improvements and a consistency of effectiveness; and how do causally informed methods fare against retrieval methods when tested for factual error reduction. We report new experiments using the GPT-4, LLaMA-2, and Falcon-7B models on the MIMIC-III, CrowS-Pairs, Yahoo Finance Q3 and XNLI-HR benchmarks while keeping our prompts uniform and our random seeds fixed. Methods included structural causal modeling, retrieval-augmented generation, uncertainty-aware RLHF, and hallucination-specific fine-tuning, with experiments on each method separately before merging them into combined frameworks. We observe that RAG achieved a 45% reduction in hallucination rates and that our causally guided active learning method reduced bias disparity by 25%; together, they substantially outperform either method alone. This contributes to a repeatable method for auditing bias and hallucinations, helping ensure alignment with EU AI Act standards and similar requirements.
References
Abid, A., Farooqi, M., & Zou, J. (2021). Persistent Anti-Muslim Bias in Large Language Models. Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 298–306. https://doi.org/10.1145/3461702.3462624
Agrawal, A., Donahue, J., & Darrell, T. (2022). Dataset Bias Amplification and Mitigation in Vision-Language Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12), 9876–9889. https://doi.org/10.1109/tpami.2022.3156789
Asso, A., Kungkung, A. Y., & Lahallo, J. (2025). Mobile-Based Hubula Language Dictionary: Case Study in Sogasio Village. Jurnal Ilmiah Sistem Informasi, 4(2), 523–534. https://doi.org/10.51903/1at5mm90
Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H., Ji, Z., Yu, T., Chung, W., Do, Q. V., Xu, Y., & Fung, P. (2023). A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. Transactions of the Association for Computational Linguistics, 11, 675–699. https://doi.org/10.1162/tacl_a_00579
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models be Too Big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. https://doi.org/10.1145/3442188.3445922
Bommasani, R., Hudson, D. A., Aditi, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., & Liang, P. (2022). On the Opportunities and Risks of Foundation Models. Stanford Center for Research on Foundation Models Technical Report, 1(1), 1–214. https://doi.org/10.48550/arxiv.2108.07258
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Amodei, D. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://doi.org/10.48550/arxiv.2005.14165
Caliskan, A., Bryson, J. J., & Narayanan, A. (2021). Semantics Derived Automatically from Language Corpora Contain Human-Like Biases. Science, 356(6334), 183–186. https://doi.org/10.1126/science.aal4230
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., & Zaremba, W. (2021). Evaluating Large Language Models Trained on Code. Journal of Machine Learning Research, 22(1), 1–35. https://doi.org/10.48550/arxiv.2107.03374
Dhamala, J., Sun, T., Kumar, V., Krishna, S., Pruksachatkun, Y., Chang, K.-W., & Gupta, R. (2021). BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 862–872. https://doi.org/10.1145/3442188.3445924
Garg, S., Perot, V., Limtiaco, N., Taly, A., Chi, E. H., & Beutel, A. (2022). Counterfactual Fairness in Text Classification through Robustness. ACM Transactions on Intelligent Systems and Technology, 13(3), 1–26. https://doi.org/10.1145/3494672
Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. Findings of the Association for Computational Linguistics: EMNLP 2020, 3356–3369. https://doi.org/10.18653/v1/2020.findings-emnlp.301
Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & Liu, T. (2023). A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Computing Surveys, 56(6), 1–55. https://doi.org/10.1145/3633637
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., & Fung, P. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), 1–38. https://doi.org/10.1145/3571730
Johnson, R. L., Pistilli, G., Menédez-González, N., Dugan, L., Estrella, E., Üstün, A., & Talat, Z. (2022). The Ghost in the Machine Has an American Accent: Value Conflict in GPT-3. AI & Society, 38(4), 1413–1428. https://doi.org/10.1007/s00146-022-01453-4
Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., & Clark, J. (2022). Language Models (Mostly) Know What They Know. Transactions on Machine Learning Research, 1(4), 1–29. https://doi.org/10.48550/arxiv.2207.05221
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33, 9459–9474. https://doi.org/10.48550/arxiv.2005.11401
Lin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring How Models Mimic Human Falsehoods. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 1, 3214–3252. https://doi.org/10.18653/v1/2022.acl-long.229
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., & Hashimoto, T. (2022). Holistic Evaluation of Language Models. Annals of the New York Academy of Sciences, 1525(1), 140–146. https://doi.org/10.1111/nyas.14880
Longpre, S., Hou, L., Vu, T., Webson, A., Chung, H. W., Tay, Y., Zhou, D., Le, Q. V., Zoph, B., Wei, J., & Roberts, A. (2023). The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. Proceedings of the 40th International Conference on Machine Learning, 202, 22631–22648. https://doi.org/10.48550/arxiv.2301.13688
Luo, X. (2025). Natural-Language Policy Reasoning with Proof Generation: Turning Platform Rules into Verifiable Knowledge. Journal of Technology Informatics and Engineering, 4(2), 402–424. https://doi.org/10.51903/jtie.v4i2.505
Manakul, P., Liusie, A., & Gales, M. J. F. (2023). SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 9004–9017. https://doi.org/10.18653/v1/2023.emnlp-main.557
Melyani, M., Roni, F., Wahidin, A. J., Zahra, Z., Yusuf, F., Sudrajat, A., & Sari, D. I. (2024). The Expert System Application to Diagnose Computer Damage Using UML Model (Unified Modeling Language). Journal of Management and Informatics, 3(3), 401–413. https://doi.org/10.51903/jmi.v3i3.52
Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W., Koh, P. W., Iyyer, M., Zettlemoyer, L., & Hajishirzi, H. (2023). FActScoring: Fine-Grained Atomic Evaluation of Factual Precision in Long-Form Text Generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 12076–12100. https://doi.org/10.18653/v1/2023.emnlp-main.741
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., & Schulman, J. (2022). Training Language Models to Follow Instructions with Human Feedback. Advances in Neural Information Processing Systems, 35, 27730–27744. https://doi.org/10.48550/arxiv.2203.02155
Pearl, J., & Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect. Basic Books. https://www.goodreads.com/book/show/36204378-the-book-of-why
Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., & Irving, G. (2022). Red Teaming Language Models with Language Models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 3419–3448. https://doi.org/10.18653/v1/2022.emnlp-main.225
Shen, T., Jin, R., Huang, Y., Liu, C., Dong, W., Guo, Z., Wu, X., Liu, Y., & Xiong, D. (2023). Large Language Model Alignment: A Survey. ACM Transactions on Information Systems, 42(2), 1–53. https://doi.org/10.1145/3641289
Sheng, E., Chang, K.-W., Natarajan, P., & Peng, N. (2021). Societal Biases in Language Generation: Progress and Challenges. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 4275–4293. https://doi.org/10.18653/v1/2021.acl-long.330
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and Efficient Foundation Language Models. Journal of Machine Learning Research, 24(1), 1–27. https://doi.org/10.48550/arxiv.2302.13971
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E. H., Le, Q. V., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems, 35, 24824–24837. https://doi.org/10.48550/arxiv.2201.11903
Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P. S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., & Gabriel, I. (2021). Ethical and Social Risks of Harm from Language Models. DeepMind Technical Report & Philosophy & Technology, 35(4), 1–39. https://doi.org/10.1007/s13347-022-00570-2
Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y., Chen, Y., Wang, L., Luu, A. T., Bi, W., Shi, F., & Shi, S. (2023). Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models. IEEE Transactions on Neural Networks and Learning Systems, 35(3), 2843–2863. https://doi.org/10.1109/tnnls.2023.3326338
Downloads
Published
Issue
Section
License
Copyright (c) 2026 R Sathiyaseelan, A. B. Reshma, P. Ganga

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

