voidly

TabPFN-v2 lost to v3.3 GradientBoosting (stratified F1 0.719 vs 0.729, LOCO 0.419 vs 0.870) — kept v3.3

We tested TabPFN-v2 (Hollmann et al. 2023, arXiv:2207.01848) as a v3.5 classifier candidate on the same 4,237-sample / 1,116-positive / 131-country / 16-feature dataset that v3.3 GradientBoosting uses. Published TabPFN benchmarks suggested +5-9pp F1 on small (less-than-10K) tabular data. On our dataset the result was the opposite: stratified 5-fold F1 0.719 +/- 0.031 (one point below v3.3 baseline 0.729), and LOCO sampled-30-largest-countries median F1 0.419 (less-than-half of v3.3 median 0.870). Promotion gates 0.78 stratified and 0.85 LOCO were both failed. v3.3 stays in production unchanged. Honest negative result.

#ml#classifier#negative-result#tabpfn#hollmann-23#honest#kept-v3.3#transformer#small-data

What we tested

ML research suggested TabPFN-v2 (a pre-trained transformer for tabular classification, Hollmann et al. 2023) might beat our v3.3 GradientBoosting classifier by 5-9pp F1 on our 4,237-sample censorship-incident dataset. Published benchmarks place TabPFN's sweet spot at less than 10K rows with single-digit feature counts, which is exactly our shape (4,237 rows x 16 features). We ran the comparison apples-to-apples: same dataset labeled_incidents_v3.3.json, same 16-feature input, same stratified 5-fold and leave-country-out eval methodology v3.3 was evaluated under.

The result

Stratified 5-fold F1: 0.719 +/- 0.031 (AUC 0.902 +/- 0.011), versus v3.3 baseline F1 0.729. A 1pp regression, well within the 3.1pp standard deviation band, so we cannot even claim statistical superiority of v3.3 here — the two are tied at the coarse level. The promote floor for v3.5 was strat F1 0.78 (v3.3 + 5pp), so this fails by ~6pp.

Leave-country-out (sampled 30 largest countries) F1 median: 0.419, mean 0.464, AUC median 0.791. v3.3's baseline LOCO median F1 is 0.870. That is a 45pp regression on the test that matters most for our use case — transferring to unseen countries. The promote floor was LOCO median 0.85; TabPFN missed it by 43pp.

Per-country results were bimodal. TabPFN nailed countries where the in-context signal was strong (UA F1 0.979, TT 0.948, US 0.933, IR 0.829, VE 0.788) and collapsed on countries where the signal is weaker or the regime is dissimilar to neighbors (SG 0.000, UZ 0.087, AZ 0.108, JO 0.111, PK 0.195, MY 0.222, MM 0.231, BY 0.235, TH 0.250, DZ 0.250, VN 0.250). v3.3's regime-similarity-weighted contagion features keep its LOCO median much higher (0.870) by smoothing out exactly these regional dependencies.

What this tells us

TabPFN is genuinely strong at in-distribution stratified classification on small tabular data — the AUC 0.902 confirms it learned a real signal. But it has no mechanism to extrapolate to unseen distributions, which is what LOCO measures and what we need in production. v3.3's hand-engineered regime-correlation neighbor features (the contagion block added in v3.3) directly encode the cross-country dependency that TabPFN's in-context attention cannot recover from raw features alone on a held-out country.

We are not promoting v3.5. v3.3 stays in production as-is. Per Occam's razor on tied stratified results, the simpler model (GradientBoosting + engineered features) wins, and the LOCO collapse confirms TabPFN is the wrong tool for our cross-country-transfer requirement.

What we shipped

  • Build script scripts/train-classifier-tabpfn.py and scripts/finalize-classifier-tabpfn.py (reproducible from the same labeled JSON).
  • API patch scripts/patch-classifier-v3.5-tabpfn-endpoint.py left in repo but not deployed — it would add a side-by-side /v1/classifier/v3.5/score/{cc} endpoint if and only if a future TabPFN candidate ever clears the floors.
  • Archived model bundle and metrics JSON on Vultr at /opt/voidly-ai/models/experimental/censorship_classifier_v3.5_tabpfn_*.pkl and ..._metrics_*.json for reproducibility.

Configuration notes for anyone replicating

We pinned tabpfn==2.2.1 — the latest Prior Labs open release, before they moved to license- gated API tokens (tabpfn==8.x requires login at ux.priorlabs.ai). LOCO eval used n_estimators=2 for wall-clock budget; stratified and final fit used the published default 4. Eval ran on the same Vultr CPU box that hosts the production classifier (no GPU), in a separate .venv-ml so the production sklearn 1.7.2 was untouched.

Raw data