Overview#

ReconEval measures how well a latent representation reconstructs the gene-expression matrix it summarises. The benchmark covers three tasks (Fig 1c) on three datasets, scored with the same metric set.

Tasks#

End-to-end reconstruction. A single model (PCA, AE, scVI, nlscVI, or mlscVI) encodes expression to a latent space and decodes back. The latent grid is {10, 32, 128, 512, 2048}. Drivers: experiments/01_end_to_end/.
Foundation-model reconstruction. A frozen FM (SE, scGPT, scConcept, SCimilarity) produces per-cell embeddings; a downstream MLP, Transformer or KNN decoder maps them back to expression. Drivers: experiments/02_foundation_model/.
Latent-shift reconstruction. Given a control cell’s latent state and a perturbation covariate, predict the post-perturbation latent state and decode it. Two methods: CellFlow (JAX flow matching) and STATE (PyTorch transformer over cell sets). Drivers: experiments/03_latent_shift/.

Datasets#

Dataset	Scope	Source
Tahoe-100M	1,137 drugs × 50 cell lines	Arc Institute / Vevo
PBMC-10M	90 cytokines × 12 donors	Parse Bio
LuCA	6 tissues, 4 diseases	Human Lung Cancer Atlas

Out-of-distribution splits#

Three OOD splits per dataset hold out cell type / line, perturbation, or condition. The split assignments live under data/reconstruction/<dataset>/split0X/.

Metric families#

The sc_reconstruction.metrics API groups metrics into three families (Fig 2 / Fig 3):

Statistical: metric_r2(), metric_mse(), metric_energy_distance().
Biological: metric_cellcycle(), metric_pathway(), metric_coexpression(), metric_deg(), metric_cytokine().
Perturbational: metric_knn_purity().

compute_all_metrics() runs all of them. aggregate_rank_percentile() produces the rank-percentile table, and funky_heatmap() renders it as the Fig 3 summary plot.