Figure 1: Guided search in autoregressive models provides a more efficient path to high-quality image generation than scaling diffusion models. (Left) A 2B autoregressive model with beam search (green) surpasses a 12B FLUX.1-dev model using random search (orange), while requiring less computation. (Right) Examples showing how beam search corrects compositional errors in baseline generations.
While inference-time scaling through search has revolutionized Large Language Models, translating these gains to image generation has proven difficult. Recent attempts to apply search strategies to continuous diffusion models show limited benefits, with simple random sampling often performing best.
We demonstrate that the discrete, sequential nature of visual autoregressive models enables effective search for image generation. We show that beam search substantially improves text-to-image generation, enabling a 2B parameter autoregressive model to outperform a 12B parameter diffusion model across benchmarks. These findings suggest that model architecture, not just scale, is critical for inference-time optimization in visual generation.
Larger performance gains than diffusion model search strategies.
Fewer function evaluations needed to achieve superior results.
Smaller model (2B) with search beats a 12B parameter model.
Generates N independent images and selects the best one. Maximizes diversity but is computationally inefficient.
At each generation scale, greedily commits to the single best token. Efficient but can get stuck in local optima.
Maintains multiple parallel generation paths (beams), balancing exploration with computational efficiency. This is our best-performing method.
Figure 2: Search strategies for autoregressive image generation. (a) Random search generates images independently. (b) Greedy token optimization commits to best token at each step. (c) Beam search maintains multiple parallel sequences.
On the T2I-CompBench++ benchmark, our 2B autoregressive model with beam search consistently outperforms the 12B FLUX.1-dev diffusion model, demonstrating the power of architecture over raw parameter count for compositional tasks.
| Method | Model | Color | Shape | Texture | Spatial | Numeracy | Complex |
|---|---|---|---|---|---|---|---|
| Ma et al. + Search | FLUX.1-dev (12B) | 0.82 | 0.60 | 0.72 | 0.30 | 0.66 | 0.38 |
| Ours + Beam search | Infinity-2B (2B) | 0.83 | 0.64 | 0.76 | 0.36 | 0.67 | 0.42 |
Beam search demonstrates substantial improvements across all GenEval categories, particularly in multi-object composition (+19%), counting tasks (+25%), and spatial positioning (+26%).
| Method | One Obj. | Two Obj. | Count | Colors | Position | Color Attr. | Overall |
|---|---|---|---|---|---|---|---|
| Baseline | 1.00 | 0.78 | 0.60 | 0.85 | 0.25 | 0.55 | 0.67 |
| Beam Search | 1.00 | 0.97 | 0.85 | 0.90 | 0.51 | 0.74 | 0.83 |
| Improvement | 0.00 | +0.19 | +0.25 | +0.05 | +0.26 | +0.19 | +0.16 |


















Figure 4: Qualitative comparison of verifiers on T2I-CompBench++. For attribute binding (e.g., Color), both verifiers perform well. For complex reasoning (e.g., Spatial), LLaVA-OneVision's capabilities are required to guide the model to a semantically correct image, correcting errors that ImageReward misses.
@article{Riise2025VisualAR,
title = {Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling},
author = {Erik Riise and Mehmet Onurcan Kaya and Dim P. Papadopoulos},
year = {2025},
journal={arXiv preprint arXiv:2510.16751},
year={2025},
url={https://erir11.github.io/visual-autoregressive-search/}
}