A Therapist-Facing Session Copilot for Live Counseling Support: Reasoning-Guided Retrieval and Ranking from Multi-Turn Counseling Dialogues
DOI:
https://doi.org/10.51903/jtie.v4i2.547Keywords:
large language models, counseling dialogue, therapy copilot, reasoning retrieval, bilingual mental health dataset, live-session supportAbstract
This study develops and evaluates a retrieval-based therapist-facing session copilot for live counseling support using the public bilingual Psy-Insight dataset. Rather than providing autonomous psychotherapy or relying on a generative large language model (LLM), the system assists human therapists by ranking historical responses, retrieving interpretable rationales, and providing conservative contextual support. The reproducible pipeline combines TF-IDF representations, class-balanced LinearSVC routers, nearest-neighbor rationale retrieval, and label-aware response ranking without LLM fine-tuning. Experiments use all 520 English and 431 Chinese sessions (6,208 and 5,776 turns, respectively) with session-level train/dev/test splits. Psychotherapy routing achieves strong macro-F1 scores of 0.897 in English and 0.757 in Chinese, whereas strategy routing remains weak (0.253 and 0.268). Label-aware rationale retrieval improves ROUGE-L from 0.145 to 0.152 in English and from 0.142 to 0.151 in Chinese. The best response-ranking approach presents retrieved reasoning in parallel rather than through reasoning-fused reranking, increasing MRR from 0.498 to 0.541 in English and from 0.519 to 0.523 in Chinese while maintaining low latency (6.98–14.64 ms/query). These results demonstrate computational feasibility but do not establish therapeutic safety or clinical effectiveness.
References
Beck, A. T., Rush, A. J., Shaw, B. F., & Emery, G. (1979). Cognitive therapy of depression. Guilford Press.
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623). ACM. https://doi.org/10.1145/3442188.3445922
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., ... Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
Chen, K., Sun, Z., Wen, Y., Lian, H., Gao, Y., & Li, Y. (2025). Psy-Insight: Explainable multi-turn bilingual dataset for mental health counseling. arXiv. https://doi.org/10.48550/arXiv.2503.03607
Chancellor, S., & De Choudhury, M. (2020). Methods in predictive techniques for mental health status on social media: A critical review. NPJ Digital Medicine, 3(1), Article 43. https://doi.org/10.1038/s41746-020-0233-7
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/BF00994018
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423
Elliott, R., Bohart, A. C., Watson, J. C., & Greenberg, L. S. (2011). Empathy. Psychotherapy, 48(1), 43–49. https://doi.org/10.1037/a0022187
Elliott, R., Bohart, A. C., Watson, J. C., & Murphy, D. (2018). Therapist empathy and client outcome: An updated meta-analysis. Psychotherapy, 55(4), 399–410. https://doi.org/10.1037/pst0000175
Gibson, J., Xiao, B., Imel, Z. E., Georgiou, P., Atkins, D. C., & Narayanan, S. (2023). Multi-label multi-task deep learning for behavioral coding. IEEE Journal of Biomedical and Health Informatics, 27(2), 810–821. https://doi.org/10.1109/JBHI.2022.3213487
Hill, C. E. (2009). Helping skills: Facilitating exploration, insight, and action (3rd ed.). American Psychological Association.
Huo, B., Boyle, A., Marfo, N., Tangamornsuksan, W., Steen, J. P., McKechnie, T., Lee, Y., Mayol, J., Antoniou, S. A., Thirunavukarasu, A. J., Sanger, S., Ramji, K., & Guyatt, G. (2025). Large language models for chatbot health advice studies: A systematic review. JAMA Network Open, 8(2), e2457879. https://doi.org/10.1001/jamanetworkopen.2024.57879
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y., Chen, D., Dai, W., Madotto, A., & Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), Article 248. https://doi.org/10.1145/3571730
Jing Chen, Xinzhuo Sun, & Vincent Brown. (2023). Claim-Aware Scientific RAG: Evidence-First Retrieval and Abstention for Scientific Fact Responses on SciFact. Journal of Advanced Computing Systems , 3(1), 16-30. https://doi.org/10.69987/JACS.2023.30102
Liu, S., Zheng, C., Demasi, O., Sabour, S., Li, Y., Yu, Z., Jiang, Y., & Huang, M. (2021). Towards emotional support dialog systems. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 3469–3483). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.269
Norcross, J. C., & Wampold, B. E. (2011). Evidence-based therapy relationships: Research conclusions and clinical practices. Psychotherapy, 48(1), 98–102. https://doi.org/10.1037/a0022161
Norcross, J. C., & Wampold, B. E. (2018). A new therapy for each patient: Evidence-based relationships and responsiveness. Journal of Clinical Psychology, 74(11), 1889–1906. https://doi.org/10.1002/jclp.22678
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744.
Ramos, J. (2003). Using TF-IDF to determine word relevance in document queries. In Proceedings of the First Instructional Conference on Machine Learning (pp. 133–142).
Rashkin, H., Smith, E. M., Li, M., & Boureau, Y.-L. (2019). Towards empathetic open-domain conversation models: A new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 5370–5381). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1534
Rennie, J. D. M., Shih, L., Teevan, J., & Karger, D. R. (2003). Tackling the poor assumptions of naive Bayes text classifiers. In Proceedings of the 20th International Conference on Machine Learning (pp. 616–623). AAAI Press.
Rogers, C. R. (1957). The necessary and sufficient conditions of therapeutic personality change. Journal of Consulting Psychology, 21(2), 95–103. https://doi.org/10.1037/h0045357
Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 68539–68551.
Smith, E. M., Williamson, M., Shuster, K., Weston, J., & Boureau, Y.-L. (2020). Can you put it all together: Evaluating conversational agents’ ability to blend skills. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 2021–2030). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.183
Stade, E. C., Stirman, S. W., Ungar, L. H., Boland, C. L., Schwartz, H. A., Yaden, D. B., Sedoc, J., DeRubeis, R. J., Willer, R., & Eichstaedt, J. C. (2024). Large language models could change the future of behavioral healthcare: A proposal for responsible development and evaluation. npj Mental Health Research, 3(1), Article 12. https://doi.org/10.1038/s44184-024-00056-z
Sun, H., Lin, Z., Zheng, C., Liu, S., & Huang, M. (2021). PsyQA: A Chinese dataset for generating long counseling text for mental health support. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (pp. 1489–1503). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.findings-acl.130
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.
Xinzhuo Sun, Jing Chen, Binghua Zhou, & Meng-Ju Kuo. (2024). ConRAG: Contradiction-Aware Retrieval-Augmented Generation under Multi-Source Conflicting Evidence. Journal of Advanced Computing Systems , 4(7), 50-64. https://doi.org/10.69987/JACS.2024.40705
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=WE_vluYUL-X
Zheng, C., Liu, S., Cai, Y., Zhou, G., Yu, Z., & Huang, M. (2023). COMAE: A multi-factor hierarchical framework for empathetic response generation. Findings of the Association for Computational Linguistics: ACL 2023, 10405–10423. https://doi.org/10.18653/v1/2023.findings-acl.659
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Yifan Zhang, Hailey Zhang

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

