TapSampling Research

TapSampling

Inference-Time Sampling with a Task-Progress-Understanding Verifier for Robotic Manipulation

ICML 2026

Sizhe Zhao¹, Shengping Zhang^{1,2 ✉}, Shuo Yang¹, Weiyu Zhao¹, Shuigen Wang³, Xiangyang Ji⁴

¹ Harbin Institute of Technology ² Harbin Institute of Technology (Weihai) Qingdao Research Institute ³ Iray Technology co., Ltd. ⁴ Tsinghua University

^✉Corresponding author

Paper Code Model

TapSampling teaser — TapSampling, an inference-time sampling framework, extends the typical single-shot inference pipeline through multi-sample generation and task-progress-guided verification.

Abstract

Existing embodied control research demonstrates remarkable performance improvements by scaling training data and model size. We instead explore inference-time strategy as an alternative axis. Non-deterministic generative models, such as diffusion and autoregressive models, have been widely adopted in the field of embodied control. However, the single-shot inference paradigm limits their performance. In this paper, we propose \textbf{TapSampling}, a plug-and-play framework for inference-time sampling. First, we introduce an Action-VAE to represent actions in a low-dimensional latent space. The Action-VAE maps initial actions from policies into a compressed posterior distribution, from which an arbitrary number of latent samples can be drawn and decoded into candidate actions that approximately follow the true action distribution. Second, we formulate action verification as task-progress outcome prediction and train the verifier by leveraging the intrinsic sequential information of robotic datasets. The predicted scores have clear semantic grounding, enabling interpretable action selection. Furthermore, TapSampling is a policy-agnostic framework. Extensive experiments in both simulated and real-world environments demonstrate that our method effectively improves multiple generalist policies substantially without further finetuning the policy models.

Framework

Action Sampling

A small set of actions is sampled from the policy, encoded and mixed into a compressed latent distribution by the Action-VAE encoder. Multiple latent samples are then drawn from the learned posterior and decoded into diverse, high-quality action candidates efficiently.

Action Verification

Positive and negative training examples are constructed automatically from expert trajectories using their intrinsic sequential information, and a verifier is trained to predict task-progress changes, which enables interpretable action selection.

Performance

Main results on the CALVIN ABC→D benchmark. TapSampling significantly improves the task success rate and the average success length of representative non-deterministic policies (Diffusion Policy, OpenVLA, and VPP) in a plug-and-play manner without further fine-tuning these policies.

Method	i^th Task Success Rate
Method	1	2	3	4	5	Avg. Len. ↑
Robo-Flamingo	82.4	61.9	46.6	33.1	23.5	2.48
RoboDual	94.4	82.7	72.1	62.4	54.4	3.66
ReconVLA	95.6	87.6	76.9	69.3	64.1	3.95
Seer	96.3	91.6	86.1	80.3	74.0	4.28
DreamVLA	98.2	94.6	89.5	83.4	78.1	4.44
Diffusion Policy	82.1	61.7	45.6	31.4	20.5	2.41
+ TapSampling	83.9 (+1.8)	65.1 (+3.4)	48.8 (+3.2)	35.7 (+4.3)	24.7 (+4.2)	2.58 (+0.17)
OpenVLA	93.4	78.2	64.1	52.2	42.4	3.30
+ TapSampling	94.5 (+1.1)	80.9 (+2.7)	68.6 (+4.5)	57.7 (+5.5)	48.8 (+6.4)	3.51 (+0.21)
VPP	96.4	92.3	88.4	84.0	78.3	4.39
+ TapSampling	96.5 (+0.1)	92.9 (+0.6)	89.4 (+1.0)	86.4 (+2.4)	81.1 (+2.8)	4.46 (+0.07)

BibTeX

@inproceedings{zhao2026tapsampling,
  title={{T}ap{S}ampling: Inference-Time Sampling with a Task-Progress-Understanding Verifier for Robotic Manipulation},
  author={Sizhe Zhao and Shengping Zhang and Shuo Yang and Weiyu Zhao and Shuigen Wang and Xiangyang Ji},
  booktitle={Forty-third International Conference on Machine Learning},
  year={2026}
}