Current research result
Aggregate calibration is strong. Class coverage is still the hard part.
The Tsinghua100 dense run uses 8,000 images across 100 breeds. DINOv2-small prototype scores are temperature-scaled, then evaluated with RAPS conformal prediction sets.
0.846
Top-1 accuracy from DINOv2-small prototypes.
0.051
Expected calibration error after temperature scaling.
0.968
Selected RAPS aggregate coverage.
2.59
Mean selected RAPS set size.
Method boundary
Measured, not inflated.
SmartBreeds is not presented as a final benchmark or a deployed guarantee. The result is a reproducible calibration study with visible failure modes.
- Top-1 accuracy is 0.846 and ECE is 0.051 after temperature scaling.
- Selected RAPS reaches 0.968 aggregate coverage with mean set size 2.59.
- Per-class disaggregation exposes weak breeds instead of hiding them behind aggregate metrics.
Reliability diagram
Temperature scaling brings confidence closer to observed accuracy, while per-class coverage remains uneven.