Current research result

Aggregate calibration is strong. Class coverage is still the hard part.

The Tsinghua100 dense run uses 8,000 images across 100 breeds. DINOv2-small prototype scores are temperature-scaled, then evaluated with RAPS conformal prediction sets.

0.846 Top-1 accuracy from DINOv2-small prototypes.

0.051 Expected calibration error after temperature scaling.

0.968 Selected RAPS aggregate coverage.

2.59 Mean selected RAPS set size.

Method boundary

Measured, not inflated.

SmartBreeds is not presented as a final benchmark or a deployed guarantee. The result is a reproducible calibration study with visible failure modes.

Top-1 accuracy is 0.846 and ECE is 0.051 after temperature scaling.
Selected RAPS reaches 0.968 aggregate coverage with mean set size 2.59.
Per-class disaggregation exposes weak breeds instead of hiding them behind aggregate metrics.

Open case study Read report

Reliability diagram

Temperature scaling brings confidence closer to observed accuracy, while per-class coverage remains uneven.