2026-05-21

TabPFN-v2 lost to v3.3 GradientBoosting (stratified F1 0.719 vs 0.729, LOCO 0.419 vs 0.870) — kept v3.3

We tested TabPFN-v2 (Hollmann et al. 2023, arXiv:2207.01848) as a v3.5 classifier candidate on the same 4,237-sample / 1,116-positive / 131-country / 16-feature dataset that v3.3 GradientBoosting uses. Published TabPFN benchmarks suggested +5-9pp F1 on small (less-than-10K) tabular data. On our dataset the result was the opposite: stratified 5-fold F1 0.719 +/- 0.031 (one point below v3.3 baseline 0.729), and LOCO sampled-30-largest-countries median F1 0.419 (less-than-half of v3.3 median 0.870). Promotion gates 0.78 stratified and 0.85 LOCO were both failed. v3.3 stays in production unchanged. Honest negative result.

#ml#classifier#negative-result#tabpfn#hollmann-23#honest#kept-v3.3#transformer#small-data

Raw data

Production classifier v3.3 info
Hollmann et al. 2023 (TabPFN)
TabPFN 2.2.1 PyPI