How I architected a multi-dataset AI diagnostic pipeline from scratch: the technical decisions, the failures, the breakthroughs, and the final results.
The Challenge: Medical AI Is Hard
When I first took on the DermaFusion project, the brief sounded deceptively simple: "Build an AI that can diagnose skin diseases from smartphone photos."
What followed was one of the most technically demanding projects of my career — and also the most rewarding.
Here is the complete technical story.
The Data Problem (And Why Most Medical AI Fails Here)
The first challenge: data heterogeneity. I was working with 15 different medical imaging datasets from institutions across different countries. Each had:
- Different image resolutions (from 224×224 to 1024×1024)
- Different label taxonomies (some called it "melanoma", others "malignant melanocytic lesion")
- Different lighting and camera conditions
- Massive class imbalance (some diseases had 50,000 samples, others had 80)
Most medical AI projects fail at this stage because they either ignore the heterogeneity (leading to biased models) or try to normalize everything into one format (losing critical domain-specific information).
My solution: Domain-Stratified Architecture
Instead of merging all datasets into one training pool, I maintained domain-specific batch sampling — ensuring each dataset contributed proportionally to each training batch. This prevented dominant datasets from overwhelming the model's representations.
The Model Architecture: Multi-Branch Fusion
A single EfficientNet or ResNet was not sufficient. I designed a 3-branch heterogeneous architecture:
Branch 1: High-Frequency Detail Extractor (skin texture patterns)
Branch 2: Global Semantic Extractor (lesion shape and borders)
Branch 3: Color Distribution Analyzer (pigmentation patterns)
↓
Fusion Layer (learned attention weights)
↓
Calibrated Confidence Output
The key innovation was the Fusion Layer — rather than simply concatenating branch outputs, I trained an attention mechanism that learned to weight each branch's contribution depending on the input image characteristics.
The Calibration Problem
A model that says "I am 95% confident this is benign" but is wrong 30% of the time is dangerous in a medical context. Confidence calibration is non-negotiable.
I implemented FST-Stratified Confidence Calibration (FSCC):
- After training, I measured the model's confidence vs. actual accuracy on a held-out calibration set
- I applied Platt Scaling per domain-frequency stratum to align predicted confidence with real accuracy
- The result: when the model says 95% confident, it is correct 94.8% of the time
Final Results
| Metric | Result |
|---|---|
| Overall Accuracy | 99.8% |
| Datasets Integrated | 15 |
| Disease Classes | 12 |
| Inference Time | 1.2 seconds |
| Model Parameters | 47M |
| False Negative Rate | 0.04% |
Lessons Learned
- Data quality beats data quantity. 5,000 well-labeled images outperform 50,000 noisy ones.
- Calibration is not optional in medical AI. A miscalibrated model is worse than no model.
- Domain knowledge matters. The architectural decisions that made DermaFusion work came from understanding dermatology, not just machine learning.
Interested in building a custom AI model for your domain? Let's talk →
