Calibrated Resume-Job Matching for Trustworthy LLM-Assisted Recruiter Screening: Pairwise Matching, Probability Calibration, and Selective Refusal on Two Public Recruitment Datasets

Jiaying  Jin

doi:10.51903/jtie.v4i3.529

Authors

Jiaying Jin Applied Analytics, Columbia University, NY, USA

DOI:

https://doi.org/10.51903/jtie.v4i3.529

Keywords:

resume-job matching, recruiter screening, probability calibration, selective refusal, trustworthy AI

Abstract

Recruiter screening increasingly relies on large language model (LLM)-assisted workflows, but high-stakes applications require reproducible matching, calibrated probabilities, and reliable handling of uncertain cases. This study evaluates a screening framework combining matching, calibration, and selective refusal using two public datasets: resume-job-description-fit for supervised pairwise learning and Resume-Screening-Dataset for benchmarking and external generalization. After deterministic preprocessing, we compared cosine similarity, alignment features, TF-IDF pairwise models, and hybrid models integrating text, alignment, and title information. The strongest probabilistic models were calibrated with Platt scaling and isotonic regression and evaluated under confidence-based refusal. On the resume-job-description-fit test set, the best three-class model achieved a macro-F1 of 0.450. For binary shortlist-versus-reject screening, the title-augmented hybrid model obtained 0.654 balanced accuracy, 0.647 F1, and 0.699 AUROC. Platt calibration improved probability estimates by reducing the Brier score from 0.232 to 0.226 and negative log-likelihood from 0.772 to 0.675. Selective refusal further improved in-domain accuracy, while cross-dataset transfer remained weak (AUROC 0.47–0.51). These results indicate that matching, calibration, and selective refusal enhance trustworthy within-domain screening, although human review remains essential under distribution shift.

References

AzharAli. (2022). Resume-Screening-Dataset [Data set]. Hugging Face. Retrieved April 11, 2026, from https://huggingface.co/datasets/AzharAli05/Resume-Screening-Dataset

Chow, C. K. (1970). On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, 16(1), 41–46. https://doi.org/10.1109/TIT.1970.1054406

cnamuangtoun. (2024). resume-job-description-fit [Data set]. Hugging Face. https://huggingface.co/datasets/cnamuangtoun/resume-job-description-fit

Daren Zheng, Boning Zhang, & Julie Geibel. (2024). VerifySafe: Toxicity-Safe Agent Responses under Adversarial Prompts with Evidence-Based Self-Verification. Journal of Advanced Computing Systems , 4(1), 67-82. https://doi.org/10.69987/JACS.2024.40106

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423

El-Yaniv, R., & Wiener, Y. (2010). On the foundations of noise-free selective classification. Journal of Machine Learning Research, 11, 1605–1641.

Gao, T., Yao, X., & Chen, D. (2021). SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 6894–6910). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.552

Geifman, Y., & El-Yaniv, R. (2017). Selective classification for deep neural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. V. N. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 30, pp. 4878–4887). Curran Associates, Inc. https://papers.nips.cc/paper/7073-selective-classification-for-deep-neural-networks

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. In D. Precup & Y. W. Teh (Eds.), Proceedings of the 34th International Conference on Machine Learning (Vol. 70, pp. 1321–1330). PMLR. https://proceedings.mlr.press/v70/guo17a.html

Hendrycks, D., & Gimpel, K. (2017). A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations. https://openreview.net/forum?id=Hkg4TI9xl

Huang, P.-S., He, X., Gao, J., Deng, L., Acero, A., & Heck, L. (2013). Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (pp. 2333–2338). Association for Computing Machinery. https://doi.org/10.1145/2505515.2505665

Jiang, H., Kim, B., Guan, M. Y., & Gupta, M. (2018). To trust or not to trust a classifier. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 31, pp. 5546–5557). Curran Associates, Inc. https://papers.nips.cc/paper/7798-to-trust-or-not-to-trust-a-classifier

Jing Chen, Xinzhuo Sun, & Vincent Brown. (2023). Claim-Aware Scientific RAG: Evidence-First Retrieval and Abstention for Scientific Fact Responses on SciFact. Journal of Advanced Computing Systems , 3(1), 16-30. https://doi.org/10.69987/JACS.2023.30102

Kochling, A., & Wehner, M. C. (2020). Discriminated by an algorithm: A systematic review of discrimination and fairness by algorithmic decision-making in the context of HR recruitment and HR development. Business Research, 13(3), 795–848. https://doi.org/10.1007/s40685-020-00134-w

Kull, M., Silva Filho, T. M., & Flach, P. (2017). Beta calibration: A well-founded and easily implemented improvement on logistic calibration for binary classifiers. In A. Singh & J. Zhu (Eds.), Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (Vol. 54, pp. 623–631). PMLR. https://proceedings.mlr.press/v54/kull17a.html

Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. V. N. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 30, pp. 6402–6413). Curran Associates, Inc. https://papers.nips.cc/paper/7219-simple-and-scalable-predictive-uncertainty-estimation-using-deep-ensembles

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.

Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning (pp. 625–632). Association for Computing Machinery. https://doi.org/10.1145/1102351.1102430

Platt, J. C. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In A. J. Smola, P. Bartlett, B. Schölkopf, & D. Schuurmans (Eds.), Advances in large margin classifiers (pp. 61–74). MIT Press.

Raghavan, M., Barocas, S., Kleinberg, J., & Levy, K. (2020). Mitigating bias in algorithmic hiring: Evaluating claims and practices. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (pp. 469–481). Association for Computing Machinery. https://doi.org/10.1145/3351095.3372828

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3982–3992). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410

Rennie, J. D. M., Shih, L., Teevan, J., & Karger, D. R. (2003). Tackling the poor assumptions of naive Bayes text classifiers. In T. Fawcett & N. Mishra (Eds.), Proceedings of the Twentieth International Conference on Machine Learning (pp. 616–623). AAAI Press.

Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389. https://doi.org/10.1561/1500000019

Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. McGraw-Hill.

Xinzhuo Sun, Yifei Lu, & Jing Chen. (2023). Controllable Long-Term User Memory for Multi-Session Dialogue: Confidence-Gated Writing, Time-Aware Retrieval-Augmented Generation, and Update/Forgetting. Journal of Advanced Computing Systems , 3(8), 9-24. https://doi.org/10.69987/JACS.2023.30802

Yunhe Li. (2024). Findable then Explainable: Retrieval–Summary Integration for Code Intelligence on a Lightweight CodeSearchNet Subset. Journal of Advanced Computing Systems , 4(7), 65-82. https://doi.org/10.69987/JACS.2024.40706

Zadrozny, B., & Elkan, C. (2001). Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In C. E. Brodley & A. P. Danyluk (Eds.), Proceedings of the Eighteenth International Conference on Machine Learning (pp. 609–616). Morgan Kaufmann.

Zadrozny, B., & Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 694–699). Association for Computing Machinery. https://doi.org/10.1145/775047.775151