Technical case study
Calibrated Dog-Breed Vision
A Tsinghua100 subset run with DINOv2-small prototypes, temperature scaling, RAPS conformal prediction sets, and explicit weak-class reporting.
Abstract
Accuracy alone is not the result.
The core experiment asks whether a strong pretrained vision backbone can produce calibrated dog-breed prediction sets that stay useful as the label space grows. The strongest aggregate result is selected RAPS at 0.968 coverage and mean set size 2.59 on a 100-breed Tsinghua Dogs subset. The main unresolved issue is class-conditional coverage: weak breeds still fall below the headline result.
- Dataset: Tsinghua Dogs subset, 100 breeds, 8,000 images, documented split protocol.
- Backbone: frozen DINOv2-small embeddings with nearest-prototype classification.
- Calibration: temperature scaling selected on calibration data, then evaluated on held-out test data.
- Conformal layer: no-test-tuning RAPS parameter selection with per-class disaggregation.
Research preview
A 96-second narrated summary of the Tsinghua100 result, the RAPS set-size reduction, and the weak-class failure mode.
Theory sketch
The math is simple enough to audit, but strict enough to constrain the claim.
Conformal prediction wraps calibrated model scores with a finite-sample coverage target. The promise is not that every prediction is correct. The promise is that, under exchangeability and a fixed procedure, the true label appears in the prediction set at the chosen rate.
Temperature scaling
Choose one scalar temperature on calibration data, then evaluate once on test data.
RAPS score
Sort class probabilities, keep the cumulative mass term, and add a rank penalty after the regularization cutoff.
Prediction set
Use the calibration quantile and include labels whose score falls below that threshold.
Tsinghua100 dense result
RAPS improves set tightness, but class-level coverage remains uneven.
The table separates aggregate coverage, mean set size, and worst-class behavior. This is the key framing: a method can look strong globally and still fail a reviewer’s class-conditional check.
| Method | Coverage | Mean set size | Worst class | Interpretation |
|---|---|---|---|---|
| Global RAPS | 0.968 | 2.5885 | great_dane, 0.80 | Best aggregate result, still not uniform across classes. |
| Mondrian RAPS | 0.9165 | 3.2055 | bluetick, 0.60 | Class-conditional evaluation exposes low-coverage breeds. |
| Family-pooled RAPS | 0.9250 | 2.0835 | english_setter, 0.60 | Promising tightness, but weak classes still need targeted analysis. |
Figures
The charts make the limitation visible.
These public visuals use synthetic dog imagery and aggregate research figures. Dataset-derived photos remain private until license review.
What the figures do not claim
They do not establish a production guarantee, a final public benchmark, or breed-level fairness. They document a research result and the next failure mode.
Reliability bins
| Confidence bin | Count | Mean confidence | Accuracy | Gap |
|---|---|---|---|---|
| 0.0-0.1 | 1 | 0.058 | 0.000 | 0.058 |
| 0.1-0.2 | 3 | 0.190 | 0.333 | 0.143 |
| 0.2-0.3 | 24 | 0.264 | 0.208 | 0.056 |
| 0.3-0.4 | 72 | 0.351 | 0.375 | 0.024 |
| 0.4-0.5 | 156 | 0.451 | 0.519 | 0.068 |
| 0.5-0.6 | 162 | 0.549 | 0.685 | 0.137 |
| 0.6-0.7 | 168 | 0.650 | 0.738 | 0.088 |
| 0.7-0.8 | 197 | 0.751 | 0.868 | 0.117 |
| 0.8-0.9 | 299 | 0.853 | 0.920 | 0.066 |
| 0.9-1.0 | 918 | 0.968 | 0.976 | 0.008 |
Lowest global RAPS class coverage
| Breed | Coverage | Mean set size | Misses | Support |
|---|---|---|---|---|
| great_dane | 0.80 | 3.45 | 4 | 20 |
| lhasa | 0.85 | 2.35 | 3 | 20 |
| tibetan_mastiff | 0.85 | 2.50 | 3 | 20 |
| english_setter | 0.85 | 2.55 | 3 | 20 |
| australian_shepherd | 0.85 | 2.95 | 3 | 20 |
| norwich_terrier | 0.85 | 3.00 | 3 | 20 |
| soft_coated_wheaten_terrier | 0.85 | 3.70 | 3 | 20 |
Weak-class slice
The next question is targeted coverage repair.
The supplementary run targets breeds that reviewers would inspect first: lhasa, tibetan mastiff, great dane, and the worst Mondrian class. The goal is not higher headline accuracy. The goal is tighter sets without hiding class-specific failures.
| Breed | Global RAPS | Mondrian RAPS | Family pooled | Top-1 |
|---|---|---|---|---|
| lhasa | 0.85 | 0.85 | 0.75 | 0.75 |
| tibetan_mastiff | 0.85 | 0.90 | 0.85 | 0.70 |
| great_dane | 0.80 | 0.70 | 0.75 | 0.55 |
Paper path
Minimum paper bar: replication plus class-conditional repair.
The current result is strong enough for a technical report and portfolio case study. A workshop draft should wait for the next slice: weak-class analysis, structured pooling, and a clean venue fit.