Most current AI models detect diseases by looking at an image like an x-Ray or skin lesion and making a prediction.

However in the real world , doctors dont just look at a patient. They ask questions like "Does it Itch", "How long you had it" etc...

Patient details are crucial inputs that purely visual AI models ignore.

So a better approach would be to create a dataset containing an image along with some patient- doctor conversation dialogues and finetuning a VLM model on it.

A VLM is a model that accepts images and text and provides a text output.

First we need a dastaset. Instead of paying thousands of doctors and patients to manually record their conversations , which is both expensive and has a lot privacy problems, we can just generate the data synthetically.

The authors of this paper created two agents, a doctor agent and a patient agent. both are VLMS to simulate the conversations.

the Agent 1 is the doctor VLM . It looks at the image , and asks follow up questions. Its goal is to gather enogh informations to make a proper diagnosis

The agent 2 is the patientVLM. It knows the ground truth of the disease and symptomps associated with. Its goal is to answer the Doc Agent's questions , based on the symptomp profile

Lets see the system desgin for creation this dataset

As input , both agents get a medical image of a skin lesion as input. The system collects the symptom profile of that disease and gives this data as context to the patientVLM.

and the conversations happen like this :

  • DocVLM asks: "Is the lesion painful?"
  • PatientVLM checks its profile and answers: "Yes, it hurts when touched."
  • DocVLM asks: "Has it grown recently?"
  • PatientVLM answers: "Yes, rapidly."

So the outcome is that this generates a history of conversations paired witht hat image.

So now we have a massive dataset of medical images + dialogues.

And now they take this doctor VLM and finetune it with the new dataset.

The model learns that visual features (what it sees) + dialogue context (what the patient says) = Better Diagnosis.

When the system is deployed for a real human patient:

  1. The human uploads an image.
  2. The DocVLM (now trained) analyzes the image.
  3. Instead of guessing immediately, it asks the human: "I see some redness. Does it feel warm to the touch?"
  4. The human answers.
  5. The AI combines the image and the answer to give a final diagnosis.

Lets design this in real life:

1) first we need to collect a dataset containing various images of diseases + the diseasename.we can use publicily available dataset like say SkinCon.

2- An image alone doesn't tell you if a rash is "itchy" or "painful."

- So we need to map the syptomps associated with it we can make an ai too lookup textbooks/databases to map every disease to its common symptoms.

So If the image is "Melanoma," the cheat sheet includes: asymmetry, irregular borders, recent growth, bleeding.)

Now that we have curated the initial dataset, we need to make two agents.

We can use VLM like gpt-4V or or any open source models like LLAVA (which is a vlm trained on medical data) and give them specific system instructions to act like the agents.

So for the doctor agent, we give a system prompt like this : "You are a dermatologist. Look at the image. Ask a question to help distinguish between possible diseases. Focus on visual details or physical sensations (like pain or itch)." along with the image.

Now to create the the patient agent, we give it the image + this sysem prompt :"You are a patient. You are experiencing the symptoms listed here: [List]. Do NOT reveal the name of your disease. Answer the doctor's questions truthfully based on these symptoms and the image provided."" We input the symtomps on the prompt itself.

and we let the conversation happen and colelct the data and train the doctor agent on this.

  • Data Efficiency: Real medical conversations are hard to get (HIPAA laws, privacy).
  • Mimics Reality: It forces the AI to reason like a human doctor (Hypothesis Confirmation) rather than just being an image classifier
  • Better Accuracy: The paper shows that this "dialogue-supervised" model significantly outperforms models that only look at images.

Link to paper: https://www.arxiv.org/abs/2601.10945


Note: I co-wrote this with Chatgpt and Gemini 3 Pro.