Natural-Language Policy Reasoning with Proof Generation: Turning Platform Rules into Verifiable Knowledge

Xiaofei Luo

doi:10.51903/jtie.v4i2.505

Authors

Xiaofei Luo Information Science, University of Illinois at Urbana-Champaign, IL, US

DOI:

https://doi.org/10.51903/jtie.v4i2.505

Keywords:

Policy Reasoning, Rule-Based Inference, Natural Language Rules, Proof Generation, Explainability

Abstract

Policy and compliance systems increasingly express rules in natural language, yet enforcement requires deterministic decisions and auditable explanations. This paper studies a practical pipeline that converts natural-language facts and rules into a verifiable knowledge base, answers queries with three-valued semantics (True/False/Unknown), and produces machine-checkable proofs. The contribution is system-level rather than a new reasoning formalism: we integrate controlled-language parsing, symbolic proof extraction, independent proof checking, and proof-based supervision in a single auditable framework. We evaluate the pipeline on two natural-language rule-reasoning benchmarks: (i) a balanced subset of ProofWriter’s open-world-assumption tasks (360 train, 360 test), and (ii) a RuleTaker-style dataset generated from its grammar and label semantics (1800 train, 900 test), both balanced across reasoning depths 0–5. We compare a text-only logistic regression baseline, a retrieval-based “proof” baseline, a symbolic forward-chaining reasoner with proof extraction, and a proof-trained classifier using generated proofs. To ensure fairness, LR-text and LR-proof share the same TF-IDF/logistic-regression setup, and the retrieval baseline uses the same representation with a fixed top-4 configuration. On ProofWriter-Balanced, the symbolic reasoner achieves 0.803 accuracy (0.808 macro-F1), while proof-trained classification reaches 0.825 accuracy (0.825 macro-F1). On RuleTaker-Rep, both methods achieve 1.000 accuracy. Proof verifiability clearly separates faithful from post-hoc explanations: symbolic proofs are verifiable for all predictions, whereas retrieval-based proofs are verifiable for only 31.4%. Sensitivity analyses varying reasoning depth, distractors, and proof corruption show that proof-based methods remain robust to noise but depend on proof integrity. These findings demonstrate the feasibility of auditable natural-language policy reasoning in controlled settings, while highlighting limitations in parser coverage and benchmark regularity.

References

Abiteboul, S., Hull, R., & Vianu, V. (1995). Foundations of Databases. Addison-Wesley. https://web.stanford.edu/~ullman/fodb.html

Appel, A. W., & Felten, E. W. (1999). Proof-Carrying Authentication. In Proceedings of the 6th ACM Conference on Computer and Communications Security, 52–62. https://doi.org/10.1145/319709.319718

Becker, M. Y., Fournet, C., & Gordon, A. D. (2007). Design and Semantics of a Decentralized Authorization Language. In Proceedings of the 20th IEEE Computer Security Foundations Symposium, 3–15. https://doi.org/10.1109/cfs.2007.13

Camburu, O.-M., Rocktäschel, T., Lukasiewicz, T., & Blunsom, P. (2018). E-SNLI: Natural Language Inference with Natural Language Explanations. In Advances in Neural Information Processing Systems, 31, 9560–9572. https://doi.org/10.48550/arxiv.1812.01193

Ceri, S., Gottlob, G., & Tanca, L. (1989). What You Always Wanted to Know about Datalog (And Never Dared to Ask). IEEE Transactions on Knowledge and Data Engineering, 1(1), 146–166. https://doi.org/10.1109/69.43410

Clark, P., Tafjord, O., & Richardson, K. (2020). Transformers as Soft Reasoners over Language. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20), 3882–3890. https://doi.org/10.24963/ijcai.2020/537

DeYoung, J., Jain, S., Rajani, N. F., Lehman, E., Xiong, C., Socher, R., & Wallace, B. C. (2020). ERASER: A Benchmark to Evaluate Rationalized NLP Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), 4443–4458. https://doi.org/10.18653/v1/2020.acl-main.408

Dries, A., Kimmig, A., Meert, W., Renkens, J., Van den Broeck, G., Vlasselaer, J., & De Raedt, L. (2015). ProbLog2: Probabilistic Logic Programming. In Machine Learning and Knowledge Discovery in Databases, 9286, 312–315. https://doi.org/10.1007/978-3-319-23461-8_38

Fedhira, & Prianto, C. (2025). Systematic Literature Review: Analysis of AI Implementation for Document Verification. Jurnal Ilmiah Sistem Informasi, 4(3), 417–430. https://doi.org/10.51903/kjjwk708

Feng, J., Xu, R., Hao, J., Sharma, H., Shen, Y., Zhao, D., & Chen, W. (2024). Language Models can be Deductive Solvers. In Findings of the Association for Computational Linguistics: NAACL 2024 (NAACL 2024), 4026–4042. https://doi.org/10.18653/v1/2024.findings-naacl.254

Hao, L. W., & Liu, R. K. (2025). Transfer Learning Approach for Sentiment Analysis in Low-Resource Austronesian Languages Using Multilingual BERT. Journal of Technology Informatics and Engineering, 4(1), 75–94. https://doi.org/10.51903/jtie.v4i1.276

Jain, S., & Wallace, B. C. (2019). Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), 3543–3556. https://doi.org/10.18653/v1/n19-1357

Jacovi, A., & Goldberg, Y. (2020). Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), 4198–4208. https://doi.org/10.18653/v1/2020.acl-main.386

Kamath, A., & Das, R. (2018). A Survey on Semantic Parsing. arXiv Preprint, arXiv:1812.00978. https://doi.org/10.48550/arxiv.1812.00978

Lei, T., Barzilay, R., & Jaakkola, T. (2016). Rationalizing Neural Predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), 107–117. https://doi.org/10.18653/v1/d16-1011

Lundberg, S. M., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems, 30, 4765–4774. https://doi.org/10.48550/arxiv.1705.07874

Lyu, Q., Havaldar, S., Stein, A., Zhang, L., Rao, D., Wong, E., Apidianaki, M., & Callison-Burch, C. (2023). Faithful Chain-of-Thought Reasoning. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, 1, 305–329. https://doi.org/10.18653/v1/2023.ijcnlp-main.20

McCarthy, J. (1959). Programs with Common Sense. In Proceedings of the Teddington Conference on the Mechanization of Thought Processes, 75–91. https://doi.org/10.1145/319709.319718

OASIS. (2013). EXtensible Access Control Markup Language (XACML) Version 3.0. OASIS Standard. OASIS Publishing. https://docs.oasis-open.org/xacml/3.0/xacml-3.0-core-spec-os-en.html

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830. https://jmlr.org/papers/v12/pedregosa11a.html

Quan, X., Valentino, M., Dennis, L. A., & Freitas, A. (2024). Verification and Refinement of Natural Language Explanations through LLM-Symbolic Theorem Proving. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), 2933–2958. https://doi.org/10.18653/v1/2024.emnlp-main.165

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Blog, 1(8), 9. https://openai.com/blog/better-language-models

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140), 1–67. https://jmlr.org/papers/v21/20-074.html

Rajani, N. F., McCann, B., Xiong, C., & Socher, R. (2019). Explain Yourself! Leveraging Language Models for Commonsense Reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), 4932–4942. https://doi.org/10.18653/v1/p19-1487

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why Should I Trust You?": Explaining the Predictions of any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16), 1135–1144. https://doi.org/10.1145/2939672.2939778

Rocktäschel, T., & Riedel, S. (2017). End-to-End Differentiable Proving. In Advances in Neural Information Processing Systems, 30, 3788–3800. https://proceedings.neurips.cc/paper/2017/hash/b2961d1e0892f39c6705d4cb549b0612-Abstract.html

Saparov, A., & He, H. (2023). Language Models can Solve Complex Reasoning Tasks by Reasoning through Proofs. arXiv Preprint, arXiv:2205.11502. https://doi.org/10.48550/arxiv.2205.11502

Serrano, S., & Smith, N. A. (2019). Is Attention Interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), 2931–2951. https://doi.org/10.18653/v1/p19-1282

Sriasih, S. D., Razak, F. A., & Ikhsan, H. A. I. (2025). AI-Driven Sentiment Analysis of Retail Investor Behavior during Market Volatility: A Study of Twitter Data in Southeast Asia. Journal of Management and Informatics, 4(1), 741–756. https://doi.org/10.51903/jmi.v4i1.179

Sun, Z., Ding, X., Du, L., Cai, B., Gao, J., Liu, T., & Qin, B. (2024). Towards Generalizable and Faithful Logic Reasoning over Natural Language via Resolution Refutation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 16527–16538. https://aclanthology.org/2024.lreccoling-main.1438

Tafjord, O., Dalvi, B., & Clark, P. (2021). ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (ACL-IJCNLP 2021), 3621–3634. https://doi.org/10.18653/v1/2021.findings-acl.317

Theoxo. (2023). Proofwriter-Deduction-Balanced (Version 1.0). Hugging Face Dataset. https://huggingface.co/datasets/theoxo/proofwriter-deduction-balanced

Turpin, M., Michael, J., Perez, E., & Bowman, S. R. (2023). Language Models don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. In Advances in Neural Information Processing Systems, 36, 74643–74660. https://proceedings.neurips.cc/paper_files/paper/2023/hash/ebca5557ca7852c7921867c4613c2bca-Abstract-Conference.html

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all You Need. In Advances in Neural Information Processing Systems, 30, 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html