A Therapist-Facing Session Copilot for Live Counseling Support: Reasoning-Guided Retrieval and Ranking from Multi-Turn Counseling Dialogues

Yifan Zhang; Hailey Zhang

doi:10.51903/jtie.v4i2.547

Authors

Yifan Zhang Department of Counseling and Clinical Psychology, Teachers College, Columbia University
Hailey Zhang Department of Electrical and Computer Engineering, Carnegie Mellon University, PA, USA

DOI:

https://doi.org/10.51903/jtie.v4i2.547

Keywords:

large language models, counseling dialogue, therapy copilot, reasoning retrieval, bilingual mental health dataset, live-session support

Abstract

This study develops and evaluates a retrieval-based therapist-facing session copilot for live counseling support using the public bilingual Psy-Insight dataset. Rather than providing autonomous psychotherapy or relying on a generative large language model (LLM), the system assists human therapists by ranking historical responses, retrieving interpretable rationales, and providing conservative contextual support. The reproducible pipeline combines TF-IDF representations, class-balanced LinearSVC routers, nearest-neighbor rationale retrieval, and label-aware response ranking without LLM fine-tuning. Experiments use all 520 English and 431 Chinese sessions (6,208 and 5,776 turns, respectively) with session-level train/dev/test splits. Psychotherapy routing achieves strong macro-F1 scores of 0.897 in English and 0.757 in Chinese, whereas strategy routing remains weak (0.253 and 0.268). Label-aware rationale retrieval improves ROUGE-L from 0.145 to 0.152 in English and from 0.142 to 0.151 in Chinese. The best response-ranking approach presents retrieved reasoning in parallel rather than through reasoning-fused reranking, increasing MRR from 0.498 to 0.541 in English and from 0.519 to 0.523 in Chinese while maintaining low latency (6.98–14.64 ms/query). These results demonstrate computational feasibility but do not establish therapeutic safety or clinical effectiveness.

References

Beck, A. T., Rush, A. J., Shaw, B. F., & Emery, G. (1979). Cognitive therapy of depression. Guilford Press.

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623). ACM. https://doi.org/10.1145/3442188.3445922

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., ... Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

Chen, K., Sun, Z., Wen, Y., Lian, H., Gao, Y., & Li, Y. (2025). Psy-Insight: Explainable multi-turn bilingual dataset for mental health counseling. arXiv. https://doi.org/10.48550/arXiv.2503.03607

Chancellor, S., & De Choudhury, M. (2020). Methods in predictive techniques for mental health status on social media: A critical review. NPJ Digital Medicine, 3(1), Article 43. https://doi.org/10.1038/s41746-020-0233-7

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/BF00994018

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423

Elliott, R., Bohart, A. C., Watson, J. C., & Greenberg, L. S. (2011). Empathy. Psychotherapy, 48(1), 43–49. https://doi.org/10.1037/a0022187

Elliott, R., Bohart, A. C., Watson, J. C., & Murphy, D. (2018). Therapist empathy and client outcome: An updated meta-analysis. Psychotherapy, 55(4), 399–410. https://doi.org/10.1037/pst0000175

Gibson, J., Xiao, B., Imel, Z. E., Georgiou, P., Atkins, D. C., & Narayanan, S. (2023). Multi-label multi-task deep learning for behavioral coding. IEEE Journal of Biomedical and Health Informatics, 27(2), 810–821. https://doi.org/10.1109/JBHI.2022.3213487

Hill, C. E. (2009). Helping skills: Facilitating exploration, insight, and action (3rd ed.). American Psychological Association.

Huo, B., Boyle, A., Marfo, N., Tangamornsuksan, W., Steen, J. P., McKechnie, T., Lee, Y., Mayol, J., Antoniou, S. A., Thirunavukarasu, A. J., Sanger, S., Ramji, K., & Guyatt, G. (2025). Large language models for chatbot health advice studies: A systematic review. JAMA Network Open, 8(2), e2457879. https://doi.org/10.1001/jamanetworkopen.2024.57879

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y., Chen, D., Dai, W., Madotto, A., & Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), Article 248. https://doi.org/10.1145/3571730

Jing Chen, Xinzhuo Sun, & Vincent Brown. (2023). Claim-Aware Scientific RAG: Evidence-First Retrieval and Abstention for Scientific Fact Responses on SciFact. Journal of Advanced Computing Systems , 3(1), 16-30. https://doi.org/10.69987/JACS.2023.30102

Liu, S., Zheng, C., Demasi, O., Sabour, S., Li, Y., Yu, Z., Jiang, Y., & Huang, M. (2021). Towards emotional support dialog systems. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 3469–3483). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.269

Norcross, J. C., & Wampold, B. E. (2011). Evidence-based therapy relationships: Research conclusions and clinical practices. Psychotherapy, 48(1), 98–102. https://doi.org/10.1037/a0022161

Norcross, J. C., & Wampold, B. E. (2018). A new therapy for each patient: Evidence-based relationships and responsiveness. Journal of Clinical Psychology, 74(11), 1889–1906. https://doi.org/10.1002/jclp.22678

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744.

Ramos, J. (2003). Using TF-IDF to determine word relevance in document queries. In Proceedings of the First Instructional Conference on Machine Learning (pp. 133–142).

Rashkin, H., Smith, E. M., Li, M., & Boureau, Y.-L. (2019). Towards empathetic open-domain conversation models: A new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 5370–5381). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1534

Rennie, J. D. M., Shih, L., Teevan, J., & Karger, D. R. (2003). Tackling the poor assumptions of naive Bayes text classifiers. In Proceedings of the 20th International Conference on Machine Learning (pp. 616–623). AAAI Press.

Rogers, C. R. (1957). The necessary and sufficient conditions of therapeutic personality change. Journal of Consulting Psychology, 21(2), 95–103. https://doi.org/10.1037/h0045357

Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 68539–68551.

Smith, E. M., Williamson, M., Shuster, K., Weston, J., & Boureau, Y.-L. (2020). Can you put it all together: Evaluating conversational agents’ ability to blend skills. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 2021–2030). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.183

Stade, E. C., Stirman, S. W., Ungar, L. H., Boland, C. L., Schwartz, H. A., Yaden, D. B., Sedoc, J., DeRubeis, R. J., Willer, R., & Eichstaedt, J. C. (2024). Large language models could change the future of behavioral healthcare: A proposal for responsible development and evaluation. npj Mental Health Research, 3(1), Article 12. https://doi.org/10.1038/s44184-024-00056-z

Sun, H., Lin, Z., Zheng, C., Liu, S., & Huang, M. (2021). PsyQA: A Chinese dataset for generating long counseling text for mental health support. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 1489–1503). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.findings-acl.130

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.

Xinzhuo Sun, Jing Chen, Binghua Zhou, & Meng-Ju Kuo. (2024). ConRAG: Contradiction-Aware Retrieval-Augmented Generation under Multi-Source Conflicting Evidence. Journal of Advanced Computing Systems , 4(7), 50-64. https://doi.org/10.69987/JACS.2024.40705

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=WE_vluYUL-X

Zheng, C., Liu, S., Cai, Y., Zhou, G., Yu, Z., & Huang, M. (2023). COMAE: A multi-factor hierarchical framework for empathetic response generation. Findings of the Association for Computational Linguistics: ACL 2023, 10405–10423. https://doi.org/10.18653/v1/2023.findings-acl.659