Reliability ledger

Artifact-backed Tsinghua100 dense results.

Public metrics for the frozen DINOv2-small SmartBreeds research harness: accuracy, calibration, conformal coverage, per-breed coverage, and the weak-class false-inclusion diagnostic.

See raw JSON Methodology

Headline metrics

Current artifact summary

Weak-class diagnostic

Combined-confuser false-inclusion

These values come from the local target-vs-confuser probe. They are diagnostic rows, not replacements for the 100-way classifier.

High-FI cluster experiment

Structured pooling reduces the tibetan mastiff false-inclusion row.

Loading high-FI cluster result...

Breed Baseline FI Structured FI Structured cov Hierarchical FI Hierarchical cov

These rows are diagnostic gates, not replacements for the global 100-way RAPS predictor.

Weak-class recovery

Structured pooling separates headline and recovery rows.

Loading weak-class recovery result...

Breed Baseline FI Recovery FI Recovery cov top-k quorum Status

Recovery rows are selected by calibration-only gates.

External validation

Stanford Dogs is a stress test, not the headline.

Loading external validation result...

Breed Base FI Structured FI Coverage top-k Missing Status

External validation rows are claim-limiting stress-test evidence.

Per-class breakdown

Coverage and calibration by breed

Each breed has 20 held-out test examples. Per-class ECE is a diagnostic 10-bin top-confidence value within that breed subset.

Filter breed

100 breeds

Breed Coverage ECE Set size Misses

Methodology

What this endpoint proves

Protocol

Frozen DINOv2-small embeddings, nearest-prototype classification, temperature scaling on the calibration split, and selected global RAPS at target coverage 0.90.

Refresh

Loading refresh metadata...

Boundary

This is a research-harness result on a Tsinghua Dogs subset. It is not a full benchmark claim, a production classifier guarantee, or permission to publish dataset-derived dog images.