Numerical-Reasoning Guardrails for a Quant Research Assistant: A Compact Reproducible Benchmark Using SEC and FRED Data

Zeyi  Li; Kai Zhang; Annie  Wong

doi:10.51903/jtie.v5i2.541

Authors

Zeyi Li Industrial Engineering, New York University, NY, USA
Kai Zhang Financial Engineering, Baruch College, NY, USA
Annie Wong Computer Science, Cornell Tech, NY, USA

DOI:

https://doi.org/10.51903/jtie.v5i2.541

Keywords:

AI safety, Financial large language models, Guardrails, Numerical reasoning, Quant research assistant

Abstract

This paper presents a compact reproducible benchmark for evaluating numerical-reasoning guardrails in a quant research assistant. The revised experiment uses a fixed 2026 source snapshot derived from the SEC 2026 Q1 Financial Statement Data Sets and FRED CSV series for VIXCLS, DGS10, DGS3MO and T10Y3M. The benchmark contains 460 tasks: 300 SEC financial-ratio tasks over 50 issuer-period records, 120 FRED VIX and Treasury-rate change tasks, and 40 macro-regime classification tasks. Each answer is evaluated by five programmatic guardrails: numeric consistency, unit correctness, time-window correctness, formula correctness and citation/source consistency. Four controlled response profiles are tested: Naive-RAG, Calculator-Only, Prompted-Checklist and Guarded-Quant. These profiles are deterministic failure-mode controls rather than performance claims about any particular deployed LLM. The empirical results show that arithmetic alone is not sufficient for financial safety: Calculator-Only reaches 79.78% numeric accuracy but only 0.43% all-guardrails pass rate because source, unit, formula and window fields often fail. Guarded-Quant achieves an 88.48% all-guardrails pass rate, 97.17% numeric accuracy, 100.00% unit pass rate, 96.30% window pass rate, 98.26% formula pass rate and 96.30% citation pass rate. The findings support a modest claim: a compact benchmark can make numerical audit failures visible, but it should not be read as evidence of broad quant-assistant reliability without broader data, live model outputs and operational stress tests.

References

Alles, M., & Piechocki, M. (2012). Will XBRL improve corporate governance? A framework for enhancing governance decision making using interactive data. International Journal of Accounting Information Systems, 13(2), 91-108.

Altman, E. I. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. The Journal of Finance, 23(4), 589-609. https://doi.org/10.1111/j.1540-6261.1968.tb00843.x

Bailey, D. H., Borwein, J. M., López de Prado, M., & Zhu, Q. J. (2017). The probability of backtest overfitting. Journal of Computational Finance, 20(4), 39-69.

Beaver, W. H. (1966). Financial ratios as predictors of failure. Journal of Accounting Research, 4, 71-111. https://doi.org/10.2307/2490171

Bender, E. M., & Friedman, B. (2018). Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6, 587-604.

Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31(3), 307-327. https://doi.org/10.1016/0304-4076(86)90063-1

Bonsón, E., Cortijo, V., & Escobar, T. (2009). Towards the global adoption of XBRL using International Financial Reporting Standards. International Journal of Accounting Information Systems, 10(1), 46-60.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Henighan, J., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877-1901.

Campbell, J. Y., & Shiller, R. J. (1988). The dividend-price ratio and expectations of future dividends and discount factors. Review of Financial Studies, 1(3), 195-228.

Debreceny, R., Farewell, S., Piechocki, M., Felden, C., & Graning, A. (2010). Does it add up? Early evidence on the data quality of XBRL filings to the SEC. Journal of Accounting and Public Policy, 29(3), 296-306.

Engle, R. F. (1982). Autoregressive conditional heteroskedasticity with estimates of the variance of United Kingdom inflation. Econometrica, 50(4), 987-1007.

Fama, E. F., & French, K. R. (1993). Common risk factors in the returns on stocks and bonds. Journal of Financial Economics, 33(1), 3-56.

Federal Reserve Bank of St. Louis. (2026). FRED economic data series: VIXCLS, DGS10, DGS3MO and T10Y3M. https://fred.stlouisfed.org/

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86-92.

Glasserman, P. (2004). Monte Carlo methods in financial engineering. Springer.

Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica, 50(4), 1029-1054.

Harvey, C. R., Liu, Y., & Zhu, H. (2016). ... and the cross-section of expected returns. Review of Financial Studies, 29(1), 5-68.

Hull, J. C. (2012). Risk management and financial institutions (3rd ed.). Wiley.

International Organization for Standardization. (2023). ISO/IEC 23894:2023 Artificial intelligence - Guidance on risk management. ISO.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Lewis, M., Yih, W.-T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459-9474.

Lo, A. W. (2002). The statistics of Sharpe ratios. Financial Analysts Journal, 58(4), 36-52.

Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., & Gebru, T. (2019). Model cards for model reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency, 220-229.

National Institute of Standards and Technology. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). U.S. Department of Commerce.

Raji, I. D., Smart, A., White, R. N., Mitchell, M., Gebru, T., Hutchinson, B., Smith-Loud, J., Theron, D., & Barnes, P. (2020). Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing. Proceedings of the ACM Conference on Fairness, Accountability, and Transparency, 33-44.

Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond accuracy: Behavioral testing of NLP models with CheckList. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4902-4912.

Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., & Dennison, D. (2015). Hidden technical debt in machine learning systems. Advances in Neural Information Processing Systems, 28, 2503-2511.

U.S. Securities and Exchange Commission. (2026). Financial Statement Data Sets. https://www.sec.gov/data-research/sec-markets-data/financial-statement-data-sets

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4), 817-838.

Wu, S., Irsoy, O., Lu, S., Dabravolski, V., Dredze, M., Gehrmann, S., Kambadur, P., Rosenberg, D., & Mann, G. (2023). BloombergGPT: A large language model for finance. arXiv.