Publication

Finetuning medical vision‑language models for interpretable and fine‑grained disease classification

Sofija Engelson; Jan Ehrhardt; Heinz Handels

In: Medical Imaging 2026: Computer-Aided Diagnosis. SPIE Medical Imaging (SPIE-2026), February 15-19, Vancouver, British Columbia, Canada, SPIE, 2/2026.

Abstract

Medical vision-language models (VLMs) hold great promise for interpreting clinical data, yet struggle with finegrained concept recognition. In this work, we propose a finetuning approach for medical VLMs that improves the discrimination of classes and class-descriptive concepts, thereby increasing interpretability by refining the reasoning behind each class prediction. To this end, we use adapter-layers for finetuning a pre-trained VLM, introduce prior knowledge about class-concept relations, and leverage supervised contrastive learning to learn inter-subject correspondences. We evaluate our method on two use cases and vision-language models: melanoma classification with MONET and multi-label lung disease classification with MedCLIP. Our results demonstrate that finetuning with as little as 5% of training data yields better results than zero-shot prediction of the original VLM, especially for fine-grained class definitions. Moreover, we show that our approach shows better performance in regard to AUC and F1-score than linear probing and prompt tuning, and it is more robust to class imbalance. This highlights the strength of leveraging prior knowledge in medical VLM finetuning for maintaining the generalization capabilities of foundation models while delivering competitive results on specialized tasks.