A Lightweight Medical Foundation Model for Cross-Modal Multi-Task Pretraining and Parameter-Efficient Few-Shot Transfer on MedMNIST

Gaotian  Mi; Tong  Ye; Dan Wood

doi:10.51903/jtie.v4i3.492

Authors

Gaotian Mi University of Pittsburgh, PA, USA
Tong Ye Computer Science, Northeastern University, CA, USA
Dan Wood Computer Engineering, Dartmouth College, NH, USA

DOI:

https://doi.org/10.51903/jtie.v4i3.492

Keywords:

medical foundation model, multi-task learning, MedMNIST, parameter-efficient fine-tuning

Abstract

Medical imaging has rapidly adopted pre-trained backbones, yet many transfer-learning pipelines remain expensive to train and difficult to adapt when data, compute, or privacy constraints limit full fine-tuning. We present STMedFM, a lightweight medical multi-task backbone baseline designed for fast prototyping across 2D images and 3D volumes. STMedFM uses modality-specific convolutional stems (2D and 3D) and a shared low-depth encoder, and it supports parameter-efficient transfer via Low-Rank Adaptation (LoRA) and bottleneck adapters. We pretrain STMedFM with supervised multi-task learning on four MedMNIST tasks (PathMNIST, BloodMNIST, DermaMNIST, and OrganMNIST3D) using official train/validation/test splits. We then compare (i) training from scratch, (ii) full fine-tuning from the multi-task checkpoint, and (iii) parameter-efficient fine-tuning (LoRA or adapters) that updates only a small fraction of parameters. Under a fixed compute budget (200 pretraining steps; 120 fine-tuning steps for 2D tasks; 50 steps for the 3D task), multi-task pretraining improved performance on PathMNIST (test accuracy 0.568 → 0.634; macro AUROC 0.886 → 0.914) and preserved most gains under PEFT (LoRA AUROC 0.909; Adapter AUROC 0.913) while training only 4,041–5,225 parameters versus 160,105 for full fine-tuning. For DermaMNIST, pretraining increased macro AUROC from 0.746 (Scratch, weighted) to 0.756 (Pretrain+Full), with similar AUROC under LoRA (0.760) and Adapter (0.763). In contrast, BloodMNIST and OrganMNIST3D showed mixed behavior, including cases where Scratch outperformed pretrained variants, indicating that transfer in this compact shared encoder is task-dependent and budget-sensitive. Calibration results were similarly non-monotonic: methods with better AUROC did not always achieve lower ECE. Overall, our results show that a small cross-modal multi-task model can serve as a practical MedMNIST-scale transfer baseline and that LoRA/adapters offer substantial parameter savings when task alignment is favorable. STMedFM should therefore be viewed as a lightweight supervised multi-task backbone on benchmark-scale tasks rather than a broadly general medical foundation model.

References

Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. arXiv Preprint arXiv:1607.06450. https://doi.org/10.48550/arxiv.1607.06450

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., … Liang, P. (2021). On The Opportunities And Risks Of Foundation Models. arXiv Preprint arXiv:2108.07258. https://doi.org/10.48550/arxiv.2108.07258

Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review, 78(1), 1–3. https://doi.org/10.1175/1520-0493(1950)078

Caruana, R. (1997). Multitask Learning. Machine Learning, 28, 41–75. https://doi.org/10.1023/a:1007379606734

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … Houlsby, N. (2021). An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR). https://doi.org/10.48550/arxiv.2010.11929

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML). https://doi.org/10.48550/arxiv.1706.04599

Han, X., Zhu, B., Wang, Y., Wu, J., Zhang, R., & Liu, Y. (2024). Parameter-Efficient Fine-Tuning Methods for Pre-Trained Language Models: A Critical Review and Assessment. arXiv Preprint arXiv:2402.12148. https://doi.org/10.48550/arxiv.2402.12148

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.48550/arxiv.2111.06377

Houlsby, N., Giurgiu, A., Jastrzębski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M., & Gelly, S. (2019). Parameter-Efficient Transfer Learning For NLP. In Proceedings of the 36th International Conference on Machine Learning (ICML). https://doi.org/10.48550/arxiv.1902.00751

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations (ICLR). https://doi.org/10.48550/arxiv.2106.09685

Kingma, D. P., & Ba, J. (2015). Adam: A Method For Stochastic Optimization. In International Conference on Learning Representations (ICLR). https://doi.org/10.48550/arxiv.1412.6980

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., … Girshick, R. (2023). Segment Anything. arXiv Preprint arXiv:2304.02643. https://doi.org/10.48550/arxiv.2304.02643

Loshchilov, I., & Hutter, F. (2019). Decoupled Weight Decay Regularization. In International Conference on Learning Representations (ICLR). https://doi.org/10.48550/arxiv.1711.05101

Ma, J., He, Y., Li, F., Han, L., You, C., & Wang, B. (2024). Segment Anything in Medical Images. arXiv Preprint arXiv:2304.12306. https://doi.org/10.48550/arxiv.2304.12306

Mahfazza, E. C., Amrozi, Y., & Muslihul Amin, F. (2025). Enhancing Information Security and Risk Governance in Hospital Electronic Medical Record Systems. Jurnal Ilmiah Sistem Informasi, 11(2), 210–225. https://doi.org/10.51903/00wfhv86

Melyani, M., Prasetyo, T. F., Rahadjeng, I. R., Mufid, Z., Rafik, A., Shaura, R. K., Daniel, D., & Emita, I. (2024). Design Framework of Expert System Program in Otolaryngology Disease Diagnosis Use Extreme Programming (XP) Method (Case Study in THB Bekasi Hospital). Journal of Technology Informatics and Engineering, 3(3), 397–416. https://doi.org/10.51903/jtie.v3i3.209

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … Sutskever, I. (2021). Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML). https://doi.org/10.48550/arxiv.2103.00020

Shen, S., Lin, Z., Gao, K., Wang, B., Tang, X., Liu, X., & Wang, J. (2023). MedCLIP: Medical Knowledge Enhanced Language-Image Pre-Training. arXiv Preprint arXiv:2301.02228. https://doi.org/10.48550/arxiv.2301.02228

Sholekhah, D. Z., & Noviar, D. (2025). Integrative Deep Learning Architecture for High Accuracy Medical Image Segmentation: Combining U Net, ResNet, and Transformers. Journal of Technology Informatics and Engineering, 4(1), 115–134. https://doi.org/10.51903/jtie.v4i1.288

Tang, Y., Yang, J., Chen, X., Ge, C., Yu, Z., Hong, L., Li, G., & Duan, L. (2023). MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification. Scientific Data, 10(1), 41. https://doi.org/10.1038/s41597-022-01721-8

Tang, Y., Yang, J., Chen, X., Ge, C., Yu, Z., Hong, L., Li, G., & Duan, L. (2024). MedMNIST+. Zenodo. https://doi.org/10.5281/zenodo.11044450

Tang, Y., Yang, J., Chen, X., Ge, C., Yu, Z., Hong, L., Li, G., & Duan, L. (2024). Rethinking Model Prototyping Through the MedMNIST+ Database. arXiv Preprint arXiv:2404.15786. https://doi.org/10.48550/arxiv.2404.15786

Thurston, T. E., et al. (2025). Foundation Models in Ophthalmology. Ophthalmology Science, 5(4), 100848. https://doi.org/10.1016/j.xops.2025.100848

Vikram, D. S., Kalaycı, T., Sikonja, C., & Indurkhya, N. (2023). BioViL-T: Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing. Microsoft Research. https://www.microsoft.com/en-us/research/publication/learning-to-exploit-temporal-structure-for-biomedical-vision-language-processing/

Willie, M. M. (2025). Value-Based Administration Services and Value-Based Care: Aligning Administrative Efficiency With Patient Outcomes. Journal of Management and Informatics, 4(3), 1032–1042. https://doi.org/10.51903/jmi.v4i3.308