Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling

Riise, Erik; Kaya, Mehmet Onurcan; Papadopoulos, Dim P.

Figure 1: Performance comparison and qualitative examples

Figure 1: Guided search in autoregressive models provides a more efficient path to high-quality image generation than scaling diffusion models. (Left) A 2B autoregressive model with beam search (green) surpasses a 12B FLUX.1-dev model using random search (orange), while requiring less computation. (Right) Examples showing how beam search corrects compositional errors in baseline generations.

Abstract

While inference-time scaling through search has revolutionized Large Language Models, translating these gains to image generation has proven difficult. Recent attempts to apply search strategies to continuous diffusion models show limited benefits, with simple random sampling often performing best.

We demonstrate that the discrete, sequential nature of visual autoregressive models enables effective search for image generation. We show that beam search substantially improves text-to-image generation, enabling a 2B parameter autoregressive model to outperform a 12B parameter diffusion model across benchmarks. These findings suggest that model architecture, not just scale, is critical for inference-time optimization in visual generation.

Larger performance gains than diffusion model search strategies.

46%

Fewer function evaluations needed to achieve superior results.

Smaller model (2B) with search beats a 12B parameter model.

Method Overview

Random Search

Generates N independent images and selects the best one. Maximizes diversity but is computationally inefficient.

Greedy Token Optimization

At each generation scale, greedily commits to the single best token. Efficient but can get stuck in local optima.

Beam Search

Maintains multiple parallel generation paths (beams), balancing exploration with computational efficiency. This is our best-performing method.

Figure 2: Search strategies for autoregressive image generation. (a) Random search generates images independently. (b) Greedy token optimization commits to best token at each step. (c) Beam search maintains multiple parallel sequences.

Quantitative Results

T2I-CompBench++ Benchmark

On the T2I-CompBench++ benchmark, our 2B autoregressive model with beam search consistently outperforms the 12B FLUX.1-dev diffusion model, demonstrating the power of architecture over raw parameter count for compositional tasks.

Method	Model	Color	Shape	Texture	Spatial	Numeracy	Complex
Ma et al. + Search	FLUX.1-dev (12B)	0.82	0.60	0.72	0.30	0.66	0.38
Ours + Beam search	Infinity-2B (2B)	0.83	0.64	0.76	0.36	0.67	0.42

GenEval Benchmark

Beam search demonstrates substantial improvements across all GenEval categories, particularly in multi-object composition (+19%), counting tasks (+25%), and spatial positioning (+26%).

Method	One Obj.	Two Obj.	Count	Colors	Position	Color Attr.	Overall
Baseline	1.00	0.78	0.60	0.85	0.25	0.55	0.67
Beam Search	1.00	0.97	0.85	0.90	0.51	0.74	0.83
Improvement	0.00	+0.19	+0.25	+0.05	+0.26	+0.19	+0.16

Qualitative Comparison: Verifier Analysis

Additional Qualitative Results

Baseline

Search-Guided

"Wooden fork, glass bowl"

❮ ❯

BibTeX

@article{Riise2025VisualAR,
  title  = {Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling},
  author = {Erik Riise and Mehmet Onurcan Kaya and Dim P. Papadopoulos},
  year   = {2025},
  journal={arXiv preprint arXiv:2510.16751},
  year={2025},
  url={https://erir11.github.io/visual-autoregressive-search/}
}