Current research result

Aggregate calibration is strong. Class coverage is still the hard part.

The Tsinghua100 dense run uses 8,000 images across 100 breeds. DINOv2-small prototype scores are temperature-scaled, then evaluated with RAPS conformal prediction sets.

0.846 Top-1 accuracy from DINOv2-small prototypes.
0.051 Expected calibration error after temperature scaling.
0.968 Selected RAPS aggregate coverage.
2.59 Mean selected RAPS set size.

Method boundary

Measured, not inflated.

SmartBreeds is not presented as a final benchmark or a deployed guarantee. The result is a reproducible calibration study with visible failure modes.

  • Top-1 accuracy is 0.846 and ECE is 0.051 after temperature scaling.
  • Selected RAPS reaches 0.968 aggregate coverage with mean set size 2.59.
  • Per-class disaggregation exposes weak breeds instead of hiding them behind aggregate metrics.
Reliability diagram for the Tsinghua100 dense run

Reliability diagram

Temperature scaling brings confidence closer to observed accuracy, while per-class coverage remains uneven.