Technical case study

Calibrated Dog-Breed Vision

A Tsinghua100 subset run with DINOv2-small prototypes, temperature scaling, RAPS conformal prediction sets, and explicit weak-class reporting.

0.968 Global RAPS coverage.

2.59 Mean global RAPS set size.

0.9165 Label-conditional coverage.

0.60 Worst observed class coverage.

Abstract

Accuracy alone is not the result.

The core experiment asks whether a strong pretrained vision backbone can produce calibrated dog-breed prediction sets that stay useful as the label space grows. The strongest aggregate result is selected RAPS at 0.968 coverage and mean set size 2.59 on a 100-breed Tsinghua Dogs subset. The main unresolved issue is class-conditional coverage: weak breeds still fall below the headline result.

Dataset: Tsinghua Dogs subset, 100 breeds, 8,000 images, documented split protocol.
Backbone: frozen DINOv2-small embeddings with nearest-prototype classification.
Calibration: temperature scaling selected on calibration data, then evaluated on held-out test data.
Conformal layer: no-test-tuning RAPS parameter selection with per-class disaggregation.

The live reliability page mirrors the current artifact-backed Tsinghua100 dense metrics, including per-breed coverage, per-class ECE, fixed top-three high-FI cluster rows, and separate coverage-aware recovery rows used by the paper draft.

Review live reliability data

External validation boundary

A Stanford Dogs weak-cluster stress test adds 280 images across 14 labels. Mean false inclusion falls from 0.0604 to 0.0125 under the transferred structured-pooling rule, but this is not a generalization claim: Stanford Dogs does not include fila_braziliero, and tibetan_mastiff misses the 0.90 coverage gate at 0.8500.

Research preview

A 96-second narrated summary of the Tsinghua100 result, the RAPS set-size reduction, and the weak-class failure mode.

Theory sketch

The math is simple enough to audit, but strict enough to constrain the claim.

Conformal prediction wraps calibrated model scores with a finite-sample coverage target. The promise is not that every prediction is correct. The promise is that, under exchangeability and a fixed procedure, the true label appears in the prediction set at the chosen rate.

Temperature scaling

Choose one scalar temperature on calibration data, then evaluate once on test data.

p_T(y | x) = softmax(z_y / T)

RAPS score

Sort class probabilities, keep the cumulative mass term, and add a rank penalty after the regularization cutoff.

s(x,y) = APS(x,y) + lambda · max(rank(y) - k, 0)

Prediction set

Use the calibration quantile and include labels whose score falls below that threshold.

C(x) = { y : s(x,y) <= q_hat }

Tsinghua100 dense result

RAPS improves set tightness, but class-level coverage remains uneven.

The table separates aggregate coverage, mean set size, and worst-class behavior. This is the key framing: a method can look strong globally and still fail a reviewer’s class-conditional check.

Method	Coverage	Mean set size	Worst class	Interpretation
Global RAPS	0.968	2.5885	great_dane, 0.80	Best aggregate result, still not uniform across classes.
Mondrian RAPS	0.9165	3.2055	bluetick, 0.60	Class-conditional evaluation exposes low-coverage breeds.
Family-pooled RAPS	0.9250	2.0835	english_setter, 0.60	Promising tightness, but weak classes still need targeted analysis.

Figures

The charts make the limitation visible.

These public visuals use synthetic dog imagery and aggregate research figures. Dataset-derived photos remain private until license review.

Reliability diagram for the Tsinghua100 dense run — Reliability diagram after temperature scaling. The reported ECE is 0.051.

Per-class coverage chart for Tsinghua100 — Per-class coverage chart. Weak breeds drive the next research slice.

Coverage versus set size chart for RAPS settings — Coverage versus set size. RAPS reduces set size relative to no-penalty conformal rows.

What the figures do not claim

They do not establish a production guarantee, a final public benchmark, or breed-level fairness. They document a research result and the next failure mode.

Reliability bins

Confidence bin	Count	Mean confidence	Accuracy	Gap
0.0-0.1	1	0.058	0.000	0.058
0.1-0.2	3	0.190	0.333	0.143
0.2-0.3	24	0.264	0.208	0.056
0.3-0.4	72	0.351	0.375	0.024
0.4-0.5	156	0.451	0.519	0.068
0.5-0.6	162	0.549	0.685	0.137
0.6-0.7	168	0.650	0.738	0.088
0.7-0.8	197	0.751	0.868	0.117
0.8-0.9	299	0.853	0.920	0.066
0.9-1.0	918	0.968	0.976	0.008

Lowest global RAPS class coverage

Breed	Coverage	Mean set size	Misses	Support
great_dane	0.80	3.45	4	20
lhasa	0.85	2.35	3	20
tibetan_mastiff	0.85	2.50	3	20
english_setter	0.85	2.55	3	20
australian_shepherd	0.85	2.95	3	20
norwich_terrier	0.85	3.00	3	20
soft_coated_wheaten_terrier	0.85	3.70	3	20

Weak-class slice

The next question is targeted coverage repair.

The supplementary run targets breeds that reviewers would inspect first: lhasa, tibetan mastiff, great dane, and the worst Mondrian class. The goal is not higher headline accuracy. The goal is tighter sets without hiding class-specific failures.

Breed	Global RAPS	Mondrian RAPS	Family pooled	Top-1
lhasa	0.85	0.85	0.75	0.75
tibetan_mastiff	0.85	0.90	0.85	0.70
great_dane	0.80	0.70	0.75	0.55

Paper path

Minimum paper bar: replication plus class-conditional repair.

The current result is strong enough for a technical report and portfolio case study. A workshop draft should wait for the next slice: weak-class analysis, structured pooling, and a clean venue fit.

Open photo search Read live report