A Lightweight Medical Foundation Model for Cross-Modal Multi-Task Pretraining and Parameter-Efficient Few-Shot Transfer on MedMNIST
DOI:
https://doi.org/10.51903/jtie.v4i3.492Keywords:
medical foundation model, multi-task learning, MedMNIST, parameter-efficient fine-tuningAbstract
Medical imaging has rapidly adopted pre-trained backbones, yet many transfer-learning pipelines remain expensive to train and difficult to adapt when data, compute, or privacy constraints limit full fine-tuning. We present STMedFM, a lightweight medical multi-task backbone baseline designed for fast prototyping across 2D images and 3D volumes. STMedFM uses modality-specific convolutional stems (2D and 3D) and a shared low-depth encoder, and it supports parameter-efficient transfer via Low-Rank Adaptation (LoRA) and bottleneck adapters. We pretrain STMedFM with supervised multi-task learning on four MedMNIST tasks (PathMNIST, BloodMNIST, DermaMNIST, and OrganMNIST3D) using official train/validation/test splits. We then compare (i) training from scratch, (ii) full fine-tuning from the multi-task checkpoint, and (iii) parameter-efficient fine-tuning (LoRA or adapters) that updates only a small fraction of parameters. Under a fixed compute budget (200 pretraining steps; 120 fine-tuning steps for 2D tasks; 50 steps for the 3D task), multi-task pretraining improved performance on PathMNIST (test accuracy 0.568 → 0.634; macro AUROC 0.886 → 0.914) and preserved most gains under PEFT (LoRA AUROC 0.909; Adapter AUROC 0.913) while training only 4,041–5,225 parameters versus 160,105 for full fine-tuning. For DermaMNIST, pretraining increased macro AUROC from 0.746 (Scratch, weighted) to 0.756 (Pretrain+Full), with similar AUROC under LoRA (0.760) and Adapter (0.763). In contrast, BloodMNIST and OrganMNIST3D showed mixed behavior, including cases where Scratch outperformed pretrained variants, indicating that transfer in this compact shared encoder is task-dependent and budget-sensitive. Calibration results were similarly non-monotonic: methods with better AUROC did not always achieve lower ECE. Overall, our results show that a small cross-modal multi-task model can serve as a practical MedMNIST-scale transfer baseline and that LoRA/adapters offer substantial parameter savings when task alignment is favorable. STMedFM should therefore be viewed as a lightweight supervised multi-task backbone on benchmark-scale tasks rather than a broadly general medical foundation model.References
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. arXiv Preprint arXiv:1607.06450. https://doi.org/10.48550/arxiv.1607.06450
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., … Liang, P. (2021). On The Opportunities And Risks Of Foundation Models. arXiv Preprint arXiv:2108.07258. https://doi.org/10.48550/arxiv.2108.07258
Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review, 78(1), 1–3. https://doi.org/10.1175/1520-0493(1950)078
Caruana, R. (1997). Multitask Learning. Machine Learning, 28, 41–75. https://doi.org/10.1023/a:1007379606734
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … Houlsby, N. (2021). An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR). https://doi.org/10.48550/arxiv.2010.11929
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML). https://doi.org/10.48550/arxiv.1706.04599
Han, X., Zhu, B., Wang, Y., Wu, J., Zhang, R., & Liu, Y. (2024). Parameter-Efficient Fine-Tuning Methods for Pre-Trained Language Models: A Critical Review and Assessment. arXiv Preprint arXiv:2402.12148. https://doi.org/10.48550/arxiv.2402.12148
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.48550/arxiv.2111.06377
Houlsby, N., Giurgiu, A., Jastrzębski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M., & Gelly, S. (2019). Parameter-Efficient Transfer Learning For NLP. In Proceedings of the 36th International Conference on Machine Learning (ICML). https://doi.org/10.48550/arxiv.1902.00751
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations (ICLR). https://doi.org/10.48550/arxiv.2106.09685
Kingma, D. P., & Ba, J. (2015). Adam: A Method For Stochastic Optimization. In International Conference on Learning Representations (ICLR). https://doi.org/10.48550/arxiv.1412.6980
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., … Girshick, R. (2023). Segment Anything. arXiv Preprint arXiv:2304.02643. https://doi.org/10.48550/arxiv.2304.02643
Loshchilov, I., & Hutter, F. (2019). Decoupled Weight Decay Regularization. In International Conference on Learning Representations (ICLR). https://doi.org/10.48550/arxiv.1711.05101
Ma, J., He, Y., Li, F., Han, L., You, C., & Wang, B. (2024). Segment Anything in Medical Images. arXiv Preprint arXiv:2304.12306. https://doi.org/10.48550/arxiv.2304.12306
Mahfazza, E. C., Amrozi, Y., & Muslihul Amin, F. (2025). Enhancing Information Security and Risk Governance in Hospital Electronic Medical Record Systems. Jurnal Ilmiah Sistem Informasi, 11(2), 210–225. https://doi.org/10.51903/00wfhv86
Melyani, M., Prasetyo, T. F., Rahadjeng, I. R., Mufid, Z., Rafik, A., Shaura, R. K., Daniel, D., & Emita, I. (2024). Design Framework of Expert System Program in Otolaryngology Disease Diagnosis Use Extreme Programming (XP) Method (Case Study in THB Bekasi Hospital). Journal of Technology Informatics and Engineering, 3(3), 397–416. https://doi.org/10.51903/jtie.v3i3.209
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … Sutskever, I. (2021). Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML). https://doi.org/10.48550/arxiv.2103.00020
Shen, S., Lin, Z., Gao, K., Wang, B., Tang, X., Liu, X., & Wang, J. (2023). MedCLIP: Medical Knowledge Enhanced Language-Image Pre-Training. arXiv Preprint arXiv:2301.02228. https://doi.org/10.48550/arxiv.2301.02228
Sholekhah, D. Z., & Noviar, D. (2025). Integrative Deep Learning Architecture for High Accuracy Medical Image Segmentation: Combining U Net, ResNet, and Transformers. Journal of Technology Informatics and Engineering, 4(1), 115–134. https://doi.org/10.51903/jtie.v4i1.288
Tang, Y., Yang, J., Chen, X., Ge, C., Yu, Z., Hong, L., Li, G., & Duan, L. (2023). MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification. Scientific Data, 10(1), 41. https://doi.org/10.1038/s41597-022-01721-8
Tang, Y., Yang, J., Chen, X., Ge, C., Yu, Z., Hong, L., Li, G., & Duan, L. (2024). MedMNIST+. Zenodo. https://doi.org/10.5281/zenodo.11044450
Tang, Y., Yang, J., Chen, X., Ge, C., Yu, Z., Hong, L., Li, G., & Duan, L. (2024). Rethinking Model Prototyping Through the MedMNIST+ Database. arXiv Preprint arXiv:2404.15786. https://doi.org/10.48550/arxiv.2404.15786
Thurston, T. E., et al. (2025). Foundation Models in Ophthalmology. Ophthalmology Science, 5(4), 100848. https://doi.org/10.1016/j.xops.2025.100848
Vikram, D. S., Kalaycı, T., Sikonja, C., & Indurkhya, N. (2023). BioViL-T: Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing. Microsoft Research. https://www.microsoft.com/en-us/research/publication/learning-to-exploit-temporal-structure-for-biomedical-vision-language-processing/
Willie, M. M. (2025). Value-Based Administration Services and Value-Based Care: Aligning Administrative Efficiency With Patient Outcomes. Journal of Management and Informatics, 4(3), 1032–1042. https://doi.org/10.51903/jmi.v4i3.308
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Gaotian Mi, Tong Ye, Dan Wood

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

