Technical case study

Calibrated Dog-Breed Vision

A Tsinghua100 subset run with DINOv2-small prototypes, temperature scaling, RAPS conformal prediction sets, and explicit weak-class reporting.

0.968 Global RAPS coverage.
2.59 Mean global RAPS set size.
0.9165 Label-conditional coverage.
0.60 Worst observed class coverage.

Abstract

Accuracy alone is not the result.

The core experiment asks whether a strong pretrained vision backbone can produce calibrated dog-breed prediction sets that stay useful as the label space grows. The strongest aggregate result is selected RAPS at 0.968 coverage and mean set size 2.59 on a 100-breed Tsinghua Dogs subset. The main unresolved issue is class-conditional coverage: weak breeds still fall below the headline result.

  • Dataset: Tsinghua Dogs subset, 100 breeds, 8,000 images, documented split protocol.
  • Backbone: frozen DINOv2-small embeddings with nearest-prototype classification.
  • Calibration: temperature scaling selected on calibration data, then evaluated on held-out test data.
  • Conformal layer: no-test-tuning RAPS parameter selection with per-class disaggregation.

Research preview

A 96-second narrated summary of the Tsinghua100 result, the RAPS set-size reduction, and the weak-class failure mode.

Theory sketch

The math is simple enough to audit, but strict enough to constrain the claim.

Conformal prediction wraps calibrated model scores with a finite-sample coverage target. The promise is not that every prediction is correct. The promise is that, under exchangeability and a fixed procedure, the true label appears in the prediction set at the chosen rate.

Temperature scaling

Choose one scalar temperature on calibration data, then evaluate once on test data.

p_T(y | x) = softmax(z_y / T)

RAPS score

Sort class probabilities, keep the cumulative mass term, and add a rank penalty after the regularization cutoff.

s(x,y) = APS(x,y) + lambda · max(rank(y) - k, 0)

Prediction set

Use the calibration quantile and include labels whose score falls below that threshold.

C(x) = { y : s(x,y) <= q_hat }

Tsinghua100 dense result

RAPS improves set tightness, but class-level coverage remains uneven.

The table separates aggregate coverage, mean set size, and worst-class behavior. This is the key framing: a method can look strong globally and still fail a reviewer’s class-conditional check.

Method Coverage Mean set size Worst class Interpretation
Global RAPS 0.968 2.5885 great_dane, 0.80 Best aggregate result, still not uniform across classes.
Mondrian RAPS 0.9165 3.2055 bluetick, 0.60 Class-conditional evaluation exposes low-coverage breeds.
Family-pooled RAPS 0.9250 2.0835 english_setter, 0.60 Promising tightness, but weak classes still need targeted analysis.

Figures

The charts make the limitation visible.

These public visuals use synthetic dog imagery and aggregate research figures. Dataset-derived photos remain private until license review.

Reliability diagram for the Tsinghua100 dense run
Reliability diagram after temperature scaling. The reported ECE is 0.051.
Per-class coverage chart for Tsinghua100
Per-class coverage chart. Weak breeds drive the next research slice.
Coverage versus set size chart for RAPS settings
Coverage versus set size. RAPS reduces set size relative to no-penalty conformal rows.

What the figures do not claim

They do not establish a production guarantee, a final public benchmark, or breed-level fairness. They document a research result and the next failure mode.

Reliability bins

Confidence bin Count Mean confidence Accuracy Gap
0.0-0.1 1 0.058 0.000 0.058
0.1-0.2 3 0.190 0.333 0.143
0.2-0.3 24 0.264 0.208 0.056
0.3-0.4 72 0.351 0.375 0.024
0.4-0.5 156 0.451 0.519 0.068
0.5-0.6 162 0.549 0.685 0.137
0.6-0.7 168 0.650 0.738 0.088
0.7-0.8 197 0.751 0.868 0.117
0.8-0.9 299 0.853 0.920 0.066
0.9-1.0 918 0.968 0.976 0.008

Lowest global RAPS class coverage

Breed Coverage Mean set size Misses Support
great_dane 0.80 3.45 4 20
lhasa 0.85 2.35 3 20
tibetan_mastiff 0.85 2.50 3 20
english_setter 0.85 2.55 3 20
australian_shepherd 0.85 2.95 3 20
norwich_terrier 0.85 3.00 3 20
soft_coated_wheaten_terrier 0.85 3.70 3 20

Weak-class slice

The next question is targeted coverage repair.

The supplementary run targets breeds that reviewers would inspect first: lhasa, tibetan mastiff, great dane, and the worst Mondrian class. The goal is not higher headline accuracy. The goal is tighter sets without hiding class-specific failures.

Breed Global RAPS Mondrian RAPS Family pooled Top-1
lhasa 0.85 0.85 0.75 0.75
tibetan_mastiff 0.85 0.90 0.85 0.70
great_dane 0.80 0.70 0.75 0.55

Paper path

Minimum paper bar: replication plus class-conditional repair.

The current result is strong enough for a technical report and portfolio case study. A workshop draft should wait for the next slice: weak-class analysis, structured pooling, and a clean venue fit.