Privacy-Robust Incrementality Estimation in Cookieless Settings via Uplift Modeling: Reproducible Evidence from the Hillstrom E-Mail Experiment

Jingwen Bai; Haozhe  Wang; Qiyou  Wu; Boning  Zhang

doi:10.51903/jtie.v5i1.468

Authors

Jingwen Bai Data Science, Columbia University, NY, USA
Haozhe Wang Operations Research and Information Engineering, Cornell, NY, USA
Qiyou Wu Artificial Intelligence, Northeastern University, MA, USA
Boning Zhang Computer Science, Georgetown University, DC, USA

DOI:

https://doi.org/10.51903/jtie.v5i1.468

Keywords:

Cookieless Measurement, Incrementality, Uplift Modeling, Heterogeneous Treatment Effects, Differential Privacy

Abstract

Measuring advertising incrementality in the absence of user-level identifiers is increasingly constrained by platform policies and privacy regulations. In cookieless environments, practitioners often observe only aggregated or weak signals (e.g., cohort-level conversion counts) and must still estimate the causal lift of an intervention while quantifying uncertainty. This paper studies cookieless incrementality evaluation through the lens of uplift and individual treatment effect (ITE) modeling under explicit privacy constraints. We conduct full experimental evaluations on the MineThatData (Hillstrom) E-Mail Analytics Challenge dataset (64,000 customers in a randomized controlled experiment with three arms). We cast the task as a binary treatment problem—sending any e-mail campaign versus sending none—and compare six ITE estimators (S-, T-, X-, R-, and doubly robust learners, plus transformed-outcome regression) against cohort-only estimators that emulate cookieless measurement. The cohort estimator uses only aggregated counts and a Bayesian beta–binomial model to shrink noisy rates, and we evaluate robustness under k-anonymity thresholds and Laplace-noised differentially private aggregates. Across held-out test data, the best ID-level model (T-learner with logistic regression) achieves a Qini coefficient of 6.675 and improves the estimated policy conversion rate when targeting the top 20% of customers by predicted uplift. Cohort-only estimation retains a weaker and more variable signal; its point estimate is sensitive to privacy constraints but yields valid uncertainty intervals with 0.892 empirical coverage for a 95% interval in cohort-level validation. The results demonstrate that (i) causal lift is estimable without identifiers when randomized experimentation is available, (ii) doubly robust estimators provide strong performance and fast scoring, and (iii) privacy-preserving aggregation introduces an accuracy–privacy trade-off that can be quantified and monitored using bootstrap and Bayesian uncertainty.

References

Apple. (2021). Take Advantage of New Advertising Attribution Technologies. Apple Developer News. https://developer.apple.com/news/?id=wajvzt18

Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized Random Forests. Annals of Statistics, 47(2), 1148–1178. https://doi.org/10.1214/18-aos1709

Bang, H., & Robins, J. M. (2005). Doubly Robust Estimation in Missing Data and Causal Inference Models. Biometrics, 61(4), 962–973. https://doi.org/10.1111/j.1541-0420.2005.00377.x

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/Debiased Machine Learning for Treatment and Structural Parameters. The Econometrics Journal, 21(1), 1–68. https://doi.org/10.1111/ectj.12097

Devriendt, F., Moldovan, D., Verbeke, W., & Baesens, B. (2018). A Literature Survey and Experimental Evaluation of the State-of-the-Art in Uplift Modeling: A Stepping Stone toward the Development of Prescriptive Analytics. Big Data, 6(1), 13–41. https://doi.org/10.1089/big.2017.0104

Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006). Calibrating Noise to Sensitivity in Private Data Analysis. Theory of Cryptography Conference (TCC 2006), Lecture Notes in Computer Science, 3876, 265–284. https://doi.org/10.1007/11681878_14

Dwork, C., & Roth, A. (2014). The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science, 9(3), 211–407. https://doi.org/10.1561/0400000042

Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. Monographs on statistics and applied probability, 57(1), 1-436. https://doi.org/10.1201/9780429246593

Google. (2025). Overview of Attribution Reporting API. Privacy Sandbox. https://privacysandbox.google.com/private-advertising/attribution-reporting

Gutierrez, P., & Gérardy, J.-Y. (2017). Causal Inference and Uplift Modeling: A Review of the Literature. In Proceedings of the 3rd International Conference on Predictive Applications and APIs, 67, 1–13. https://proceedings.mlr.press/v67/gutierrez17a.html

Hernán, M. A., & Robins, J. M. (2020). Causal inference: What If Chapman & Hall/CRC. https://miguelhernan.org/whatifbook

Hillstrom, K. (2008, March 20). The MineThatData E Mail Analytics and Data Mining Challenge. MineThatData. https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html

Hanqi Zhang. (2023). DriftGuard: Multi-Signal Drift Early Warning and Safe Re-Training/Rollback for CTR/CVR Models. Journal of Advanced Computing Systems, 3(7), 24-40. https://doi.org/10.69987/jacs.2023.30703

Hanqi Zhang. (2024). Risk-Aware Budget-Constrained Auto-Bidding under First-Price RTB: A Distributional Constrained Deep Reinforcement Learning Framework. Journal of Advanced Computing Systems, 4(6), 30-47. https://doi.org/10.69987/jacs.2024.40603

Hanqi Zhang. (2025). Counterfactual Learning-to-Rank for Ads: Off-Policy Evaluation on the Open Bandit Dataset. Journal of Advanced Computing Systems, 5(12), 1-11. https://doi.org/10.69987/jacs.2025.51201

Jamaludin, H., Achlison, U., & Rokhman, N. (2024). Enhancing AI Model Accuracy and Scalability Through Big Data and Cloud Computing. Journal of Technology Informatics and Engineering, 3(3), 296–307. https://doi.org/10.51903/jtie.v3i3.203

Imbens, G. W., & Rubin, D. B. (2015). Causal inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press. https://doi.org/10.1017/cbo9781139025751

Jubin Zhang. (2025). Graph-based Knowledge Tracing for Personalized MOOC Path Recommendation. Journal of Advanced Computing Systems, 5(11), 1-15. https://doi.org/10.69987/jacs.2025.51101

Kohavi, R., Longbotham, R., Sommerfield, D., & Henne, R. M. (2009). Controlled Experiments on the Web: Survey and Practical Guide. Data Mining and Knowledge Discovery, 18(1), 140–181. https://doi.org/10.1007/s10618-008-0114-1

Künzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). Metalearners for Estimating Heterogeneous Treatment Effects Using Machine Learning. Proceedings of the National Academy of Sciences, 116(10), 4156–4165. https://doi.org/10.1073/pnas.1804597116

Lo, V. S. Y. (2002). The True Lift Model: A Novel Data Mining Approach to Response Modeling in Database Marketing. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 481–486. ttps://doi.org/10.1145/772862.772872

Nie, X., & Wager, S. (2021). Quasi-Oracle Estimation of Heterogeneous Treatment Effects. Biometrika, 108(2), 299–319. https://doi.org/10.1093/biomet/asaa076

Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press. https://www.cambridge.org/9780521895606

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., VanderPlas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit Learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830. https://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf

Radcliffe, N. J. (2007). Using Control Groups to Target on Predicted Lift: Building and Assessing Uplift Models. Direct Marketing Analytics Journal, 1, 14–21, 1, 14–21. https://doi.org/10.1007/s10796-022-10283-4

Radcliffe, N. J., & Surry, P. D. (2011). Quality Measures for Uplift Models. Technical report. https://www.stochasticsolutions.com/pdf/kdd2011late.pdf

Rubin, D. B. (1974). Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies. Journal of Educational Psychology, 66(5), 688–701. https://doi.org/10.1037/h0037350

Rzepakowski, P., & Jaroszewicz, S. (2012). Decision Trees for Uplift Modeling with Single and Multiple Treatments. Knowledge and Information Systems, 32(2), 303–327. https://doi.org/10.1007/s10115-011-0434-0

Shirakawa, T., Li, Y., Wu, Y., Qiu, S., Li, Y., Zhao, M., Iso, H., & van der Laan, M. (2024). Longitudinal Targeted Minimum Loss-Based Estimation with Temporal-Difference Heterogeneous Transformer. Proceedings of machine learning research, 235, 45097. https://pmc.ncbi.nlm.nih.gov/articles/pmc12681028/

Sklift. (2021). fetch_hillstrom: MineThatData E-Mail Analytics and Data Mining Challenge Dataset (Copy). https://www.uplift-modeling.com/en/v0.3.1/api/datasets/fetch_hillstrom.html

Sweeney, L. (2002). k-anonymity: A Model for Protecting Privacy. International Journal on Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5), 557–570. https://doi.org/10.1142/s0218488502001648

Wager, S., & Athey, S. (2018). Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests. Journal of the American Statistical Association, 113(523), 1228–1242. https://doi.org/10.1080/01621459.2017.1319839

Shirakawa, T., Li, Y., Wu, Y., Qiu, S., Li, Y., Zhao, M., Iso, H., & Van der Laan, M. (2024). Longitudinal Targeted Minimum Loss-Based Estimation with Temporal-Difference Heterogeneous Transformer. In Proceedings of the 41st International Conference on Machine Learning, 235, 45097. https://pmc.ncbi.nlm.nih.gov/articles/pmc12681028

Xinzhuo Sun, Yifei Lu, & Jing Chen. (2023). Controllable Long-Term User Memory for Multi-Session Dialogue: Confidence-Gated Writing, Time-Aware Retrieval-Augmented Generation, and Update/Forgetting. Journal of Advanced Computing Systems, 3(8), 9-24. https://doi.org/10.69987/jacs.2023.30802

Xinzhuo Sun, Jing Chen, Binghua Zhou, & Meng-Ju Kuo. (2024). ConRAG: Contradiction-Aware Retrieval-Augmented Generation under Multi-Source Conflicting Evidence. Journal of Advanced Computing Systems, 4(7), 50-64. https://doi.org/10.69987/jacs.2024.40705