Bias and Hallucination Evaluation in LLMs

Authors

DOI:

https://doi.org/10.51903/jtie.v5i1.476

Keywords:

Bias, Hallucination, Large Language Models, Causal Inference, Retrieval-Augmented Generation

Abstract

The largest failure modes of LLMs to-date, bias and hallucination, have measurable harms in contexts where factuality and fairness are paramount. Both areas have experienced significant research growth; however, prior work on each generally operates as a disparate body of research, and there is a gap in a methodological framework for jointly measuring, tracing, and reducing both under the same experimental conditions. We provide that framework through an empirical evaluation (not a survey) of bias propagation and hallucination generation on four illustrative domains (medical, legal, finance, human resources) through a framework that addresses the three research questions: how can bias and hallucination be measured simultaneously through a replicable, domain-specific protocol; which techniques yield statistically meaningful improvements and a consistency of effectiveness; and how do causally informed methods fare against retrieval methods when tested for factual error reduction. We report new experiments using the GPT-4, LLaMA-2, and Falcon-7B models on the MIMIC-III, CrowS-Pairs, Yahoo Finance Q3 and XNLI-HR benchmarks while keeping our prompts uniform and our random seeds fixed. Methods included structural causal modeling, retrieval-augmented generation, uncertainty-aware RLHF, and hallucination-specific fine-tuning, with experiments on each method separately before merging them into combined frameworks. We observe that RAG achieved a 45% reduction in hallucination rates and that our causally guided active learning method reduced bias disparity by 25%; together, they substantially outperform either method alone. This contributes to a repeatable method for auditing bias and hallucinations, helping ensure alignment with EU AI Act standards and similar requirements.

References

Abid, A., Farooqi, M., & Zou, J. (2021). Persistent Anti-Muslim Bias in Large Language Models. Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 298–306. https://doi.org/10.1145/3461702.3462624

Agrawal, A., Donahue, J., & Darrell, T. (2022). Dataset Bias Amplification and Mitigation in Vision-Language Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12), 9876–9889. https://doi.org/10.1109/tpami.2022.3156789

Asso, A., Kungkung, A. Y., & Lahallo, J. (2025). Mobile-Based Hubula Language Dictionary: Case Study in Sogasio Village. Jurnal Ilmiah Sistem Informasi, 4(2), 523–534. https://doi.org/10.51903/1at5mm90

Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H., Ji, Z., Yu, T., Chung, W., Do, Q. V., Xu, Y., & Fung, P. (2023). A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. Transactions of the Association for Computational Linguistics, 11, 675–699. https://doi.org/10.1162/tacl_a_00579

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models be Too Big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. https://doi.org/10.1145/3442188.3445922

Bommasani, R., Hudson, D. A., Aditi, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., & Liang, P. (2022). On the Opportunities and Risks of Foundation Models. Stanford Center for Research on Foundation Models Technical Report, 1(1), 1–214. https://doi.org/10.48550/arxiv.2108.07258

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Amodei, D. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://doi.org/10.48550/arxiv.2005.14165

Caliskan, A., Bryson, J. J., & Narayanan, A. (2021). Semantics Derived Automatically from Language Corpora Contain Human-Like Biases. Science, 356(6334), 183–186. https://doi.org/10.1126/science.aal4230

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., & Zaremba, W. (2021). Evaluating Large Language Models Trained on Code. Journal of Machine Learning Research, 22(1), 1–35. https://doi.org/10.48550/arxiv.2107.03374

Dhamala, J., Sun, T., Kumar, V., Krishna, S., Pruksachatkun, Y., Chang, K.-W., & Gupta, R. (2021). BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 862–872. https://doi.org/10.1145/3442188.3445924

Garg, S., Perot, V., Limtiaco, N., Taly, A., Chi, E. H., & Beutel, A. (2022). Counterfactual Fairness in Text Classification through Robustness. ACM Transactions on Intelligent Systems and Technology, 13(3), 1–26. https://doi.org/10.1145/3494672

Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. Findings of the Association for Computational Linguistics: EMNLP 2020, 3356–3369. https://doi.org/10.18653/v1/2020.findings-emnlp.301

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & Liu, T. (2023). A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Computing Surveys, 56(6), 1–55. https://doi.org/10.1145/3633637

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., & Fung, P. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), 1–38. https://doi.org/10.1145/3571730

Johnson, R. L., Pistilli, G., Menédez-González, N., Dugan, L., Estrella, E., Üstün, A., & Talat, Z. (2022). The Ghost in the Machine Has an American Accent: Value Conflict in GPT-3. AI & Society, 38(4), 1413–1428. https://doi.org/10.1007/s00146-022-01453-4

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., & Clark, J. (2022). Language Models (Mostly) Know What They Know. Transactions on Machine Learning Research, 1(4), 1–29. https://doi.org/10.48550/arxiv.2207.05221

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33, 9459–9474. https://doi.org/10.48550/arxiv.2005.11401

Lin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring How Models Mimic Human Falsehoods. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 1, 3214–3252. https://doi.org/10.18653/v1/2022.acl-long.229

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., & Hashimoto, T. (2022). Holistic Evaluation of Language Models. Annals of the New York Academy of Sciences, 1525(1), 140–146. https://doi.org/10.1111/nyas.14880

Longpre, S., Hou, L., Vu, T., Webson, A., Chung, H. W., Tay, Y., Zhou, D., Le, Q. V., Zoph, B., Wei, J., & Roberts, A. (2023). The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. Proceedings of the 40th International Conference on Machine Learning, 202, 22631–22648. https://doi.org/10.48550/arxiv.2301.13688

Luo, X. (2025). Natural-Language Policy Reasoning with Proof Generation: Turning Platform Rules into Verifiable Knowledge. Journal of Technology Informatics and Engineering, 4(2), 402–424. https://doi.org/10.51903/jtie.v4i2.505

Manakul, P., Liusie, A., & Gales, M. J. F. (2023). SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 9004–9017. https://doi.org/10.18653/v1/2023.emnlp-main.557

Melyani, M., Roni, F., Wahidin, A. J., Zahra, Z., Yusuf, F., Sudrajat, A., & Sari, D. I. (2024). The Expert System Application to Diagnose Computer Damage Using UML Model (Unified Modeling Language). Journal of Management and Informatics, 3(3), 401–413. https://doi.org/10.51903/jmi.v3i3.52

Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W., Koh, P. W., Iyyer, M., Zettlemoyer, L., & Hajishirzi, H. (2023). FActScoring: Fine-Grained Atomic Evaluation of Factual Precision in Long-Form Text Generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 12076–12100. https://doi.org/10.18653/v1/2023.emnlp-main.741

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., & Schulman, J. (2022). Training Language Models to Follow Instructions with Human Feedback. Advances in Neural Information Processing Systems, 35, 27730–27744. https://doi.org/10.48550/arxiv.2203.02155

Pearl, J., & Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect. Basic Books. https://www.goodreads.com/book/show/36204378-the-book-of-why

Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., & Irving, G. (2022). Red Teaming Language Models with Language Models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 3419–3448. https://doi.org/10.18653/v1/2022.emnlp-main.225

Shen, T., Jin, R., Huang, Y., Liu, C., Dong, W., Guo, Z., Wu, X., Liu, Y., & Xiong, D. (2023). Large Language Model Alignment: A Survey. ACM Transactions on Information Systems, 42(2), 1–53. https://doi.org/10.1145/3641289

Sheng, E., Chang, K.-W., Natarajan, P., & Peng, N. (2021). Societal Biases in Language Generation: Progress and Challenges. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 4275–4293. https://doi.org/10.18653/v1/2021.acl-long.330

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and Efficient Foundation Language Models. Journal of Machine Learning Research, 24(1), 1–27. https://doi.org/10.48550/arxiv.2302.13971

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E. H., Le, Q. V., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems, 35, 24824–24837. https://doi.org/10.48550/arxiv.2201.11903

Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P. S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., & Gabriel, I. (2021). Ethical and Social Risks of Harm from Language Models. DeepMind Technical Report & Philosophy & Technology, 35(4), 1–39. https://doi.org/10.1007/s13347-022-00570-2

Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y., Chen, Y., Wang, L., Luu, A. T., Bi, W., Shi, F., & Shi, S. (2023). Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models. IEEE Transactions on Neural Networks and Learning Systems, 35(3), 2843–2863. https://doi.org/10.1109/tnnls.2023.3326338

Downloads

Published

2026-04-30

How to Cite

Bias and Hallucination Evaluation in LLMs. (2026). Journal of Technology Informatics and Engineering, 5(1), 271-287. https://doi.org/10.51903/jtie.v5i1.476