Offline Counterfactual Evaluation for Advertising and Recommendation Slot Policies: A Reproducible Study on the Open Bandit Dataset (Small)

Jinyi Mu; Tong Ye; Priya Patel

doi:10.51903/jtie.v4i3.500

Authors

Jinyi Mu Computer Science and Engineering, UCSD, CA, USA
Tong Ye Computer Science, Northeastern University, CA, USA
Priya Patel Computer Science, Heriot-Watt University, Edinburgh, UK

DOI:

https://doi.org/10.51903/jtie.v4i3.500

Keywords:

Off-Policy Evaluation, Counterfactual Evaluation, Contextual Bandits, Inverse Propensity Scoring

Abstract

Offline or counterfactual evaluation is a critical capability for iterating advertising and recommender ranking strategies when online A/B testing is slow, expensive, or risky. Off-policy evaluation (OPE) estimates the expected reward of a candidate policy using logged interaction data from a different behavior policy. Still, it can suffer from high variance under poor overlap and can be misleading when the operational objective is choosing among candidate policies rather than minimizing point-estimation bias alone. This paper presents a fully reproducible empirical study of IPS, self-normalized IPS (SNIPS), doubly robust (DR), and Switch-DR estimators on the Open Bandit Dataset (OBD) small release. Using the Men and Women campaigns (10,000 logged item-impressions per campaign and behavior policy) collected by uniform random and Bernoulli Thompson Sampling (BTS), we construct a held-out oracle for stationary slot-wise policies from the random-traffic split and evaluate both value estimation and policy-ranking consistency on random-logged and BTS-logged test sets. Across 1,000 nonparametric bootstrap replications, IPS and SNIPS are accurate on randomly logged data, whereas BTS-logged data exhibit extreme importance weights and very small effective sample sizes (ESS), making IPS-based ranking unreliable under weak support. Switch-DR is most useful in moderate-overlap regimes, where it truncates high-variance corrections. Still, it introduces bias that depends on the switching threshold and must therefore be stress-tested rather than treated as a universally superior estimator. Finally, we provide a structured reporting template—based on oracle decomposition, overlap diagnostics, and estimator components—for explaining why a policy appears better and how reliable that conclusion is.

References

Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., & Schapire, R. E. (2014). Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. In Proceedings of the 31st International Conference on Machine Learning (ICML), 32(2), 1638–1646. https://proceedings.mlr.press/v32/agarwalb14.html

Agisti, S., & Ariani, D. N. (2025). An Evaluative Study and Implementation Analysis of Disaster Management Information Systems in Local Governments Within the Context of Cities with High Disaster Risk Levels. JUISI: Jurnal Ilmiah Sistem Informasi, 4(1), 100–111. https://doi.org/10.51903/je353h34

Bai, J., Wang, H., Wu, Q., & Zhang, B. (2025). Privacy-Robust Incrementality Estimation in Cookieless Settings via Uplift Modeling: Reproducible Evidence from the Hillstrom E-Mail Experiment. Journal of Technology Informatics and Engineering, 5(1), 17–38. https://doi.org/10.51903/jtie.v5i1.468

Bang, H., & Robins, J. M. (2005). Doubly Robust Estimation in Missing Data and Causal Inference Models. Biometrics, 61(4), 962–973. https://doi.org/10.1111/j.1541-0420.2005.00377.x

Beygelzimer, A., & Langford, J. (2009). The Offset Tree for Learning with Partial Labels. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 129–138. https://doi.org/10.1145/1557019.1557033

Bottou, L., Peters, J., Quinonero-Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P., & Snelson, E. (2013). Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research, 14(1), 3207–3260. http://jmlr.org/papers/v14/bottou13a.html

Chapelle, O., & Li, L. (2011). An Empirical Evaluation of Thompson Sampling. In Advances in Neural Information Processing Systems (NEURIPS, 2249–2257. https://papers.nips.cc/paper_files/paper/2011/hash/e53a0a2978c28872a4505bdb51db06dc-Abstract.html

Dudík, M., Langford, J., & Li, L. (2011). Doubly Robust Policy Evaluation and Learning. In Proceedings of the 28th International Conference on Machine Learning (ICML), 1097–1104. https://proceedings.mlr.press/v15/dudik11a.html

Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC, 57(1), 1-436. https://doi.org/10.1201/9780429246593

Hidayat, M. S., Muhammad, W., & Isdayanti, P. L. (2025). Digital Marketing Ethics in the Age of AI: A Comparative Analysis of Transparency and Consumer Trust in E-Commerce Platforms. Journal of Management and Informatics, 4(1), 723–740. https://doi.org/10.51903/jmi.v4i1.178

Horvitz, D. G., & Thompson, D. J. (1952). A Generalization of Sampling Without Replacement From a Finite Universe. Journal of the American Statistical Association, 47(260), 663–685. https://doi.org/10.1080/01621459.1952.10483446

Joachims, T., Swaminathan, A., & de Rijke, M. (2018). Deep Learning With Logged Bandit Feedback. In International Conference on Learning Representations (ICLR), 1-12. https://openreview.net/forum?id=SyS2zZ-C-

Kallus, N., & Uehara, M. (2019). Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning. In Advances in Neural Information Processing Systems (NEURIPS), 32, 1-10. https://papers.nips.cc/paper_files/paper/2019/hash/1ebcecc6ff5d2caa8c36a316ac3a73b7-Abstract.html

Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A Contextual-Bandit Approach to Personalized News Article Recommendation. In Proceedings of the 19th International Conference on World Wide Web (WWW), 661–670. https://doi.org/10.1145/1772690.1772758

Li, L., Chu, W., Langford, J., & Wang, X. (2011). Unbiased Offline Evaluation of Contextual-Bandit-Based News Article Recommendation Algorithms. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM), 297–306. https://doi.org/10.1145/1935826.1935861

Tolah, A., & Malatji, M. (2025). Evaluating Digital Transformation Within Integration Limitations Using Desk-Based Analytical Case Study. Journal of Technology Informatics and Engineering, 4(2), 289–299. https://doi.org/10.51903/jtie.v4i2.365

Saito, Y., Aihara, S., Matsutani, N., & Narita, Y. (2021). Open Bandit Dataset and Pipeline: Toward Realistic and Reproducible Off-Policy Evaluation. In Advances in Neural Information Processing Systems: Datasets and Benchmarks Track. https://datasets-benchmarks-neurips21.github.io/openbandit/

Saito, Y., Udagawa, T., Kiyohara, H., Mogi, K., Narita, Y., & Tateno, K. (2021). Evaluating Off-Policy Evaluation: Sensitivity and Robustness. In Proceedings of the 15th ACM Conference on Recommender Systems (RecSys), 181–190. https://doi.org/10.1145/3460231.3474243

Strehl, A. L., Langford, J., Li, L., & Kakade, S. M. (2010). Learning From Logged Implicit Exploration Data. In Advances in Neural Information Processing Systems, 23, 1-9. https://papers.nips.cc/paper_files/paper/2010/hash/5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html

Su, Y., Dimakopoulou, M., Krishnamurthy, A., & Dudík, M. (2020). Doubly Robust Off-Policy Evaluation With Shrinkage. In Proceedings of the 37th International Conference on Machine Learning (ICML), 9167–9176. http://proceedings.mlr.press/v119/su20a.html

Swaminathan, A., & Joachims, T. (2015). Counterfactual Risk Minimization: Learning From Logged Bandit Feedback. In Proceedings of the 32nd International Conference on Machine Learning (ICML), 814–823. http://proceedings.mlr.press/v37/swaminathan15.html

Swaminathan, A., & Joachims, T. (2015). The Self-Normalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems (NEURIPS), 28, 1-9. https://papers.nips.cc/paper_files/paper/2015/hash/3b2d8f0b0c580c873d1a0b9c9f4ffb42-Abstract.html

Thomas, P. S., & Brunskill, E. (2016). Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), 2139–2148. http://proceedings.mlr.press/v48/thomas16.html

Thomas, P. S., Theocharous, G., & Ghavamzadeh, M. (2015). High-Confidence Off-Policy Evaluation. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, 3000–3006. https://ojs.aaai.org/index.php/AAAI/article/view/9746

Wang, Y.-X., Agarwal, A., & Dudík, M. (2017). Optimal and Adaptive Off-Policy Evaluation in Contextual Bandits. In Proceedings of the 34th International Conference on Machine Learning (ICML), 3589–3597. http://proceedings.mlr.press/v70/wang17f.html