Off-Policy Evaluation and Conservative Policy Selection for Slot-Level Dynamic Bidding and Ranking on the Open Bandit Dataset (Small)
DOI:
https://doi.org/10.51903/jtie.v5i1.503Keywords:
Offline Reinforcement Learning, Contextual Bandits, Dynamic Ranking, Dynamic BiddingAbstract
Dynamic bidding and ranking systems must improve revenue or engagement while avoiding harmful regressions during deployment. This paper presents an end-to-end offline OPE and conservative policy-selection workflow for slot-level contextual bandit approximations of ranking decisions. Using the small Open Bandit Dataset (OBD-small) from ZOZOTOWN (ZOZO, Inc.), each logged row is treated as a context-dependent choice among discrete actions (items), with binary click rewards and logged propensity. This formulation is suitable at the slot level but does not capture full listwise ranking or multi-step offline reinforcement learning. Dynamic bidding and ranking systems must improve revenue or engagement while avoiding harmful regressions during deployment. This paper presents an end-to-end offline OPE and conservative policy-selection workflow for slot-level contextual bandit approximations of ranking decisions. Using the small Open Bandit Dataset (OBD-small) from ZOZOTOWN (ZOZO, Inc.), each logged row is treated as a context-dependent choice among discrete actions (items), with binary click rewards and logged propensity. This formulation is suitable at the slot level but does not capture full listwise ranking or multi-step offline reinforcement learning. Empirically, highly deterministic evaluation policies exhibit extreme variance under sparse clicks, while the logistic reward model remains weak (ROC-AUC ≈ 0.5), limiting DM/DR interpretability. Clipped-DR mixing yields only limited certified improvements: in the women’s campaign, gains appear only at moderate confidence (δ=0.10) and for caps up to M=5, whereas stricter or looser settings revert to baseline; in the men’s campaign, certification is largely absent. These findings demonstrate that OPE diagnostics and conservative mixing enable reproducible offline selection under uncertainty, but do not indicate deployment-ready improvements.
References
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-Time Analysis of the Multiarmed Bandit Problem. Machine Learning, 47(2), 235–256. https://doi.org/10.1023/a:1013689704352
Bottou, L., Peters, J., Quiñonero-Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P., & Snelson, E. (2013). Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research, 14(101), 3207–3260. http://jmlr.org/papers/v14/bottou13a.html
Davison, A. C., & Hinkley, D. V. (1997). Bootstrap Methods and Their Application. Cambridge University Press. https://doi.org/10.1017/cbo9780511802843
Dudík, M., Langford, J., & Li, L. (2011). Doubly Robust Policy Evaluation and Learning. In Proceedings of the 28th International Conference on Machine Learning (ICML), 1097–1104. https://arxiv.org/abs/1103.4601
Dudík, M., Langford, J., & Li, L. (2014). Doubly Robust Policy Evaluation and Optimization. Statistical Science, 29(4), 485–511. https://doi.org/10.1214/14-sts485
Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC. https://doi.org/10.1201/9780429246593
Jiang, N., & Li, L. (2016). Doubly Robust Off-Policy Value Evaluation for Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), 652–661. https://proceedings.mlr.press/v48/jiang16.html
Joachims, T., Swaminathan, A., & de Rijke, M. (2017). Unbiased Learning-to-Rank with Biased Feedback. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining (WSDM), 781–789. https://doi.org/10.1145/3018661.3018704
Kakade, S., & Langford, J. (2002). Approximately Optimal Approximate Reinforcement Learning. In Proceedings of the 19th International Conference on Machine Learning (ICML), 267–274. https://doi.org/10.5555/645531.656011
Ram, K. S., Hoon, P. J., & Yeon, H. J. (2025). A Hybrid Noise Reduction And Normalization Framework For Improving Multimodal Sensor Data Quality In Real-Time Systems. Journal of Technology Informatics and Engineering, 4(3), 350-368. https://doi.org/10.51903/jtie.v4i3.440
Kumar, A., Zhou, A., Tucker, G., & Levine, S. (2020). Conservative Q-Learning for Offline Reinforcement Learning. In Advances in Neural Information Processing Systems (NeurIPS), 1179–1191. https://proceedings.neurips.cc/paper/2020/hash/0d2b1ed03d5448366938a9d18dfc63a5-abstract.html
Laroche, R., Trichelair, P., & Tachet des Combes, R. (2019). Safe Policy Improvement with Baseline Bootstrapping. In Proceedings of the 36th International Conference on Machine Learning (ICML), 3652–3661. https://proceedings.mlr.press/v97/laroche19a.html
Levine, S., Kumar, A., Tucker, G., & Fu, J. (2020). Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv preprint arXiv:2005.01643, 1–31. https://arxiv.org/abs/2005.01643
Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A Contextual-Bandit Approach to Personalized News Article Recommendation. In Proceedings of the 19th International Conference on World Wide Web (WWW), 661–670. https://doi.org/10.1145/1772690.1772758
Oktavia, D. Z., Hidayat, D. A., Natalia, D., Prabantara, S. K., & Arfriandi, A. (2026). Machine Learning Performance Comparison for Web Application Security Threat Detection: A Systematic Review. Jurnal Ilmiah Sistem Informasi, 5(1), 326-339. https://doi.org/10.51903/dhayjg79
Petrik, M., Chow, Y., & Ghavamzadeh, M. (2016). Safe Policy Improvement by Minimizing Robust Baseline Regret. In Advances in Neural Information Processing Systems (NeurIPS), 2298–2306. https://proceedings.neurips.cc/paper/2016/hash/30018695029e2832a829141f23788a87-abstract.html
Saito, Y., Aihara, S., Matsutani, M., & Narita, Y. (2020). Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. arXiv preprint arXiv:2008.07146, 1–45. https://arxiv.org/abs/2008.07146
Saito, Y., Aihara, S., Matsutani, M., & Narita, Y. (2021). Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. In Proceedings of the NeurIPS 2021 Datasets and Benchmarks Track, 1–14. https://openreview.net/forum?id=99o-9YpWXv
Simon, M., Din, S. M., & Chib, R. J. (2026). A Comparative Study on Self-Organization in Wireless Sensor Networks. Jurnal of Technology Informatics and Engineering, 5(1), 39-53. https://doi.org/10.51903/jtie.v5i1.483
Siswanto, E., Wahyuning, S., Qosidah, N., Huda, H. I., & Asti, P. (2024). Enhancing Employee Engagement through Gamified Digital Platforms: A Case Study Approach in the Technology Sector. Journal of Management and Informatics, 3(3), 531-548. https://doi.org/10.51903/jmi.v3i3.59
Swaminathan, A., & Joachims, T. (2015). Counterfactual Risk Minimization: Learning from Logged Bandit Feedback. In Proceedings of the 32nd International Conference on Machine Learning (ICML), 814–823. https://proceedings.mlr.press/v37/swaminathan15.html
Swaminathan, A., & Joachims, T. (2015). The Self-Normalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems (NeurIPS), 3231–3239. https://proceedings.neurips.cc/paper/2015/hash/39027dfad5102b9d14ce1447a734966e-abstract.html
Thomas, P. S., Theocharous, G., & Ghavamzadeh, M. (2015). High-Confidence Off-Policy Evaluation. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI), 3000–3006. https://doi.org/10.1609/aaai.v29i1.9602
Thomas, P. S., & Brunskill, E. (2016). Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), 2046–2054. https://proceedings.mlr.press/v48/thomas16.html
Wang, X., Golbandi, N., Bendersky, M., Metzler, D., & Najork, M. (2018). Position Bias Estimation for Unbiased Learning to Rank in Personal Search. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM), 610–618. https://doi.org/10.1145/3159652.3159733
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Tong Ye, Jinyi Mu, James Hunter

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

