Natural-Language Policy Reasoning with Proof Generation: Turning Platform Rules into Verifiable Knowledge
DOI:
https://doi.org/10.51903/jtie.v4i2.505Keywords:
Policy Reasoning, Rule-Based Inference, Natural Language Rules, Proof Generation, ExplainabilityAbstract
Policy and compliance systems increasingly express rules in natural language, yet enforcement requires deterministic decisions and auditable explanations. This paper studies a practical pipeline that converts natural-language facts and rules into a verifiable knowledge base, answers queries with three-valued semantics (True/False/Unknown), and produces machine-checkable proofs. The contribution is system-level rather than a new reasoning formalism: we integrate controlled-language parsing, symbolic proof extraction, independent proof checking, and proof-based supervision in a single auditable framework. We evaluate the pipeline on two natural-language rule-reasoning benchmarks: (i) a balanced subset of ProofWriter’s open-world-assumption tasks (360 train, 360 test), and (ii) a RuleTaker-style dataset generated from its grammar and label semantics (1800 train, 900 test), both balanced across reasoning depths 0–5. We compare a text-only logistic regression baseline, a retrieval-based “proof” baseline, a symbolic forward-chaining reasoner with proof extraction, and a proof-trained classifier using generated proofs. To ensure fairness, LR-text and LR-proof share the same TF-IDF/logistic-regression setup, and the retrieval baseline uses the same representation with a fixed top-4 configuration. On ProofWriter-Balanced, the symbolic reasoner achieves 0.803 accuracy (0.808 macro-F1), while proof-trained classification reaches 0.825 accuracy (0.825 macro-F1). On RuleTaker-Rep, both methods achieve 1.000 accuracy. Proof verifiability clearly separates faithful from post-hoc explanations: symbolic proofs are verifiable for all predictions, whereas retrieval-based proofs are verifiable for only 31.4%. Sensitivity analyses varying reasoning depth, distractors, and proof corruption show that proof-based methods remain robust to noise but depend on proof integrity. These findings demonstrate the feasibility of auditable natural-language policy reasoning in controlled settings, while highlighting limitations in parser coverage and benchmark regularity.
References
Abiteboul, S., Hull, R., & Vianu, V. (1995). Foundations of Databases. Addison-Wesley. https://web.stanford.edu/~ullman/fodb.html
Appel, A. W., & Felten, E. W. (1999). Proof-Carrying Authentication. In Proceedings of the 6th ACM Conference on Computer and Communications Security, 52–62. https://doi.org/10.1145/319709.319718
Becker, M. Y., Fournet, C., & Gordon, A. D. (2007). Design and Semantics of a Decentralized Authorization Language. In Proceedings of the 20th IEEE Computer Security Foundations Symposium, 3–15. https://doi.org/10.1109/cfs.2007.13
Camburu, O.-M., Rocktäschel, T., Lukasiewicz, T., & Blunsom, P. (2018). E-SNLI: Natural Language Inference with Natural Language Explanations. In Advances in Neural Information Processing Systems, 31, 9560–9572. https://doi.org/10.48550/arxiv.1812.01193
Ceri, S., Gottlob, G., & Tanca, L. (1989). What You Always Wanted to Know about Datalog (And Never Dared to Ask). IEEE Transactions on Knowledge and Data Engineering, 1(1), 146–166. https://doi.org/10.1109/69.43410
Clark, P., Tafjord, O., & Richardson, K. (2020). Transformers as Soft Reasoners over Language. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20), 3882–3890. https://doi.org/10.24963/ijcai.2020/537
DeYoung, J., Jain, S., Rajani, N. F., Lehman, E., Xiong, C., Socher, R., & Wallace, B. C. (2020). ERASER: A Benchmark to Evaluate Rationalized NLP Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), 4443–4458. https://doi.org/10.18653/v1/2020.acl-main.408
Dries, A., Kimmig, A., Meert, W., Renkens, J., Van den Broeck, G., Vlasselaer, J., & De Raedt, L. (2015). ProbLog2: Probabilistic Logic Programming. In Machine Learning and Knowledge Discovery in Databases, 9286, 312–315. https://doi.org/10.1007/978-3-319-23461-8_38
Fedhira, & Prianto, C. (2025). Systematic Literature Review: Analysis of AI Implementation for Document Verification. Jurnal Ilmiah Sistem Informasi, 4(3), 417–430. https://doi.org/10.51903/kjjwk708
Feng, J., Xu, R., Hao, J., Sharma, H., Shen, Y., Zhao, D., & Chen, W. (2024). Language Models can be Deductive Solvers. In Findings of the Association for Computational Linguistics: NAACL 2024 (NAACL 2024), 4026–4042. https://doi.org/10.18653/v1/2024.findings-naacl.254
Hao, L. W., & Liu, R. K. (2025). Transfer Learning Approach for Sentiment Analysis in Low-Resource Austronesian Languages Using Multilingual BERT. Journal of Technology Informatics and Engineering, 4(1), 75–94. https://doi.org/10.51903/jtie.v4i1.276
Jain, S., & Wallace, B. C. (2019). Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), 3543–3556. https://doi.org/10.18653/v1/n19-1357
Jacovi, A., & Goldberg, Y. (2020). Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), 4198–4208. https://doi.org/10.18653/v1/2020.acl-main.386
Kamath, A., & Das, R. (2018). A Survey on Semantic Parsing. arXiv Preprint, arXiv:1812.00978. https://doi.org/10.48550/arxiv.1812.00978
Lei, T., Barzilay, R., & Jaakkola, T. (2016). Rationalizing Neural Predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), 107–117. https://doi.org/10.18653/v1/d16-1011
Lundberg, S. M., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems, 30, 4765–4774. https://doi.org/10.48550/arxiv.1705.07874
Lyu, Q., Havaldar, S., Stein, A., Zhang, L., Rao, D., Wong, E., Apidianaki, M., & Callison-Burch, C. (2023). Faithful Chain-of-Thought Reasoning. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, 1, 305–329. https://doi.org/10.18653/v1/2023.ijcnlp-main.20
McCarthy, J. (1959). Programs with Common Sense. In Proceedings of the Teddington Conference on the Mechanization of Thought Processes, 75–91. https://doi.org/10.1145/319709.319718
OASIS. (2013). EXtensible Access Control Markup Language (XACML) Version 3.0. OASIS Standard. OASIS Publishing. https://docs.oasis-open.org/xacml/3.0/xacml-3.0-core-spec-os-en.html
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830. https://jmlr.org/papers/v12/pedregosa11a.html
Quan, X., Valentino, M., Dennis, L. A., & Freitas, A. (2024). Verification and Refinement of Natural Language Explanations through LLM-Symbolic Theorem Proving. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), 2933–2958. https://doi.org/10.18653/v1/2024.emnlp-main.165
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Blog, 1(8), 9. https://openai.com/blog/better-language-models
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140), 1–67. https://jmlr.org/papers/v21/20-074.html
Rajani, N. F., McCann, B., Xiong, C., & Socher, R. (2019). Explain Yourself! Leveraging Language Models for Commonsense Reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), 4932–4942. https://doi.org/10.18653/v1/p19-1487
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why Should I Trust You?": Explaining the Predictions of any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16), 1135–1144. https://doi.org/10.1145/2939672.2939778
Rocktäschel, T., & Riedel, S. (2017). End-to-End Differentiable Proving. In Advances in Neural Information Processing Systems, 30, 3788–3800. https://proceedings.neurips.cc/paper/2017/hash/b2961d1e0892f39c6705d4cb549b0612-Abstract.html
Saparov, A., & He, H. (2023). Language Models can Solve Complex Reasoning Tasks by Reasoning through Proofs. arXiv Preprint, arXiv:2205.11502. https://doi.org/10.48550/arxiv.2205.11502
Serrano, S., & Smith, N. A. (2019). Is Attention Interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), 2931–2951. https://doi.org/10.18653/v1/p19-1282
Sriasih, S. D., Razak, F. A., & Ikhsan, H. A. I. (2025). AI-Driven Sentiment Analysis of Retail Investor Behavior during Market Volatility: A Study of Twitter Data in Southeast Asia. Journal of Management and Informatics, 4(1), 741–756. https://doi.org/10.51903/jmi.v4i1.179
Sun, Z., Ding, X., Du, L., Cai, B., Gao, J., Liu, T., & Qin, B. (2024). Towards Generalizable and Faithful Logic Reasoning over Natural Language via Resolution Refutation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 16527–16538. https://aclanthology.org/2024.lreccoling-main.1438
Tafjord, O., Dalvi, B., & Clark, P. (2021). ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (ACL-IJCNLP 2021), 3621–3634. https://doi.org/10.18653/v1/2021.findings-acl.317
Theoxo. (2023). Proofwriter-Deduction-Balanced (Version 1.0). Hugging Face Dataset. https://huggingface.co/datasets/theoxo/proofwriter-deduction-balanced
Turpin, M., Michael, J., Perez, E., & Bowman, S. R. (2023). Language Models don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. In Advances in Neural Information Processing Systems, 36, 74643–74660. https://proceedings.neurips.cc/paper_files/paper/2023/hash/ebca5557ca7852c7921867c4613c2bca-Abstract-Conference.html
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all You Need. In Advances in Neural Information Processing Systems, 30, 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Xiaofei Luo

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

