Off-Policy Evaluation and Conservative Policy Selection for Slot-Level Dynamic Bidding and Ranking on the Open Bandit Dataset (Small)

Tong Ye; Jinyi Mu; James Hunter

doi:10.51903/jtie.v5i1.503

Authors

Tong Ye Computer Science, Northeastern University, CA, USA
Jinyi Mu Computer Science and Engineering, UCSD, CA, USA
James Hunter Computer Science, University of Colorado Boulder, CO, USA

DOI:

https://doi.org/10.51903/jtie.v5i1.503

Keywords:

Offline Reinforcement Learning, Contextual Bandits, Dynamic Ranking, Dynamic Bidding

Abstract

Dynamic bidding and ranking systems must improve revenue or engagement while avoiding harmful regressions during deployment. This paper presents an end-to-end offline OPE and conservative policy-selection workflow for slot-level contextual bandit approximations of ranking decisions. Using the small Open Bandit Dataset (OBD-small) from ZOZOTOWN (ZOZO, Inc.), each logged row is treated as a context-dependent choice among discrete actions (items), with binary click rewards and logged propensity. This formulation is suitable at the slot level but does not capture full listwise ranking or multi-step offline reinforcement learning. Dynamic bidding and ranking systems must improve revenue or engagement while avoiding harmful regressions during deployment. This paper presents an end-to-end offline OPE and conservative policy-selection workflow for slot-level contextual bandit approximations of ranking decisions. Using the small Open Bandit Dataset (OBD-small) from ZOZOTOWN (ZOZO, Inc.), each logged row is treated as a context-dependent choice among discrete actions (items), with binary click rewards and logged propensity. This formulation is suitable at the slot level but does not capture full listwise ranking or multi-step offline reinforcement learning. Empirically, highly deterministic evaluation policies exhibit extreme variance under sparse clicks, while the logistic reward model remains weak (ROC-AUC ≈ 0.5), limiting DM/DR interpretability. Clipped-DR mixing yields only limited certified improvements: in the women’s campaign, gains appear only at moderate confidence (δ=0.10) and for caps up to M=5, whereas stricter or looser settings revert to baseline; in the men’s campaign, certification is largely absent. These findings demonstrate that OPE diagnostics and conservative mixing enable reproducible offline selection under uncertainty, but do not indicate deployment-ready improvements.

References

Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-Time Analysis of the Multiarmed Bandit Problem. Machine Learning, 47(2), 235–256. https://doi.org/10.1023/a:1013689704352

Bottou, L., Peters, J., Quiñonero-Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P., & Snelson, E. (2013). Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research, 14(101), 3207–3260. http://jmlr.org/papers/v14/bottou13a.html

Davison, A. C., & Hinkley, D. V. (1997). Bootstrap Methods and Their Application. Cambridge University Press. https://doi.org/10.1017/cbo9780511802843

Dudík, M., Langford, J., & Li, L. (2011). Doubly Robust Policy Evaluation and Learning. In Proceedings of the 28th International Conference on Machine Learning (ICML), 1097–1104. https://arxiv.org/abs/1103.4601

Dudík, M., Langford, J., & Li, L. (2014). Doubly Robust Policy Evaluation and Optimization. Statistical Science, 29(4), 485–511. https://doi.org/10.1214/14-sts485

Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC. https://doi.org/10.1201/9780429246593

Jiang, N., & Li, L. (2016). Doubly Robust Off-Policy Value Evaluation for Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), 652–661. https://proceedings.mlr.press/v48/jiang16.html

Joachims, T., Swaminathan, A., & de Rijke, M. (2017). Unbiased Learning-to-Rank with Biased Feedback. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining (WSDM), 781–789. https://doi.org/10.1145/3018661.3018704

Kakade, S., & Langford, J. (2002). Approximately Optimal Approximate Reinforcement Learning. In Proceedings of the 19th International Conference on Machine Learning (ICML), 267–274. https://doi.org/10.5555/645531.656011

Ram, K. S., Hoon, P. J., & Yeon, H. J. (2025). A Hybrid Noise Reduction And Normalization Framework For Improving Multimodal Sensor Data Quality In Real-Time Systems. Journal of Technology Informatics and Engineering, 4(3), 350-368. https://doi.org/10.51903/jtie.v4i3.440

Kumar, A., Zhou, A., Tucker, G., & Levine, S. (2020). Conservative Q-Learning for Offline Reinforcement Learning. In Advances in Neural Information Processing Systems (NeurIPS), 1179–1191. https://proceedings.neurips.cc/paper/2020/hash/0d2b1ed03d5448366938a9d18dfc63a5-abstract.html

Laroche, R., Trichelair, P., & Tachet des Combes, R. (2019). Safe Policy Improvement with Baseline Bootstrapping. In Proceedings of the 36th International Conference on Machine Learning (ICML), 3652–3661. https://proceedings.mlr.press/v97/laroche19a.html

Levine, S., Kumar, A., Tucker, G., & Fu, J. (2020). Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv preprint arXiv:2005.01643, 1–31. https://arxiv.org/abs/2005.01643

Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A Contextual-Bandit Approach to Personalized News Article Recommendation. In Proceedings of the 19th International Conference on World Wide Web (WWW), 661–670. https://doi.org/10.1145/1772690.1772758

Oktavia, D. Z., Hidayat, D. A., Natalia, D., Prabantara, S. K., & Arfriandi, A. (2026). Machine Learning Performance Comparison for Web Application Security Threat Detection: A Systematic Review. Jurnal Ilmiah Sistem Informasi, 5(1), 326-339. https://doi.org/10.51903/dhayjg79

Petrik, M., Chow, Y., & Ghavamzadeh, M. (2016). Safe Policy Improvement by Minimizing Robust Baseline Regret. In Advances in Neural Information Processing Systems (NeurIPS), 2298–2306. https://proceedings.neurips.cc/paper/2016/hash/30018695029e2832a829141f23788a87-abstract.html

Saito, Y., Aihara, S., Matsutani, M., & Narita, Y. (2020). Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. arXiv preprint arXiv:2008.07146, 1–45. https://arxiv.org/abs/2008.07146

Saito, Y., Aihara, S., Matsutani, M., & Narita, Y. (2021). Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. In Proceedings of the NeurIPS 2021 Datasets and Benchmarks Track, 1–14. https://openreview.net/forum?id=99o-9YpWXv

Simon, M., Din, S. M., & Chib, R. J. (2026). A Comparative Study on Self-Organization in Wireless Sensor Networks. Jurnal of Technology Informatics and Engineering, 5(1), 39-53. https://doi.org/10.51903/jtie.v5i1.483

Siswanto, E., Wahyuning, S., Qosidah, N., Huda, H. I., & Asti, P. (2024). Enhancing Employee Engagement through Gamified Digital Platforms: A Case Study Approach in the Technology Sector. Journal of Management and Informatics, 3(3), 531-548. https://doi.org/10.51903/jmi.v3i3.59

Swaminathan, A., & Joachims, T. (2015). Counterfactual Risk Minimization: Learning from Logged Bandit Feedback. In Proceedings of the 32nd International Conference on Machine Learning (ICML), 814–823. https://proceedings.mlr.press/v37/swaminathan15.html

Swaminathan, A., & Joachims, T. (2015). The Self-Normalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems (NeurIPS), 3231–3239. https://proceedings.neurips.cc/paper/2015/hash/39027dfad5102b9d14ce1447a734966e-abstract.html

Thomas, P. S., Theocharous, G., & Ghavamzadeh, M. (2015). High-Confidence Off-Policy Evaluation. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI), 3000–3006. https://doi.org/10.1609/aaai.v29i1.9602

Thomas, P. S., & Brunskill, E. (2016). Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), 2046–2054. https://proceedings.mlr.press/v48/thomas16.html

Wang, X., Golbandi, N., Bendersky, M., Metzler, D., & Najork, M. (2018). Position Bias Estimation for Unbiased Learning to Rank in Personal Search. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM), 610–618. https://doi.org/10.1145/3159652.3159733