What we tested
ML research suggested TabPFN-v2 (a pre-trained transformer for
tabular classification, Hollmann et al. 2023) might beat our v3.3
GradientBoosting classifier by 5-9pp F1 on our 4,237-sample
censorship-incident dataset. Published benchmarks place TabPFN's
sweet spot at less than 10K rows with single-digit feature counts,
which is exactly our shape (4,237 rows x 16 features). We ran the
comparison apples-to-apples: same dataset
labeled_incidents_v3.3.json, same
16-feature input, same stratified 5-fold and leave-country-out
eval methodology v3.3 was evaluated under.
The result
Stratified 5-fold F1: 0.719 +/- 0.031 (AUC 0.902 +/- 0.011), versus v3.3 baseline F1 0.729. A 1pp regression, well within the 3.1pp standard deviation band, so we cannot even claim statistical superiority of v3.3 here — the two are tied at the coarse level. The promote floor for v3.5 was strat F1 0.78 (v3.3 + 5pp), so this fails by ~6pp.
Leave-country-out (sampled 30 largest countries) F1 median: 0.419, mean 0.464, AUC median 0.791. v3.3's baseline LOCO median F1 is 0.870. That is a 45pp regression on the test that matters most for our use case — transferring to unseen countries. The promote floor was LOCO median 0.85; TabPFN missed it by 43pp.
Per-country results were bimodal. TabPFN nailed countries where the in-context signal was strong (UA F1 0.979, TT 0.948, US 0.933, IR 0.829, VE 0.788) and collapsed on countries where the signal is weaker or the regime is dissimilar to neighbors (SG 0.000, UZ 0.087, AZ 0.108, JO 0.111, PK 0.195, MY 0.222, MM 0.231, BY 0.235, TH 0.250, DZ 0.250, VN 0.250). v3.3's regime-similarity-weighted contagion features keep its LOCO median much higher (0.870) by smoothing out exactly these regional dependencies.
What this tells us
TabPFN is genuinely strong at in-distribution stratified classification on small tabular data — the AUC 0.902 confirms it learned a real signal. But it has no mechanism to extrapolate to unseen distributions, which is what LOCO measures and what we need in production. v3.3's hand-engineered regime-correlation neighbor features (the contagion block added in v3.3) directly encode the cross-country dependency that TabPFN's in-context attention cannot recover from raw features alone on a held-out country.
We are not promoting v3.5. v3.3 stays in production as-is. Per Occam's razor on tied stratified results, the simpler model (GradientBoosting + engineered features) wins, and the LOCO collapse confirms TabPFN is the wrong tool for our cross-country-transfer requirement.
What we shipped
- Build script
scripts/train-classifier-tabpfn.pyandscripts/finalize-classifier-tabpfn.py(reproducible from the same labeled JSON). - API patch
scripts/patch-classifier-v3.5-tabpfn-endpoint.pyleft in repo but not deployed — it would add a side-by-side/v1/classifier/v3.5/score/{cc}endpoint if and only if a future TabPFN candidate ever clears the floors. - Archived model bundle and metrics JSON on Vultr at
/opt/voidly-ai/models/experimental/censorship_classifier_v3.5_tabpfn_*.pkland..._metrics_*.jsonfor reproducibility.
Configuration notes for anyone replicating
We pinned tabpfn==2.2.1 — the
latest Prior Labs open release, before they moved to license-
gated API tokens (tabpfn==8.x
requires login at ux.priorlabs.ai). LOCO eval used
n_estimators=2 for wall-clock
budget; stratified and final fit used the published default 4.
Eval ran on the same Vultr CPU box that hosts the production
classifier (no GPU), in a separate
.venv-ml so the production sklearn
1.7.2 was untouched.