Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling

1Technical University of Denmark, 2Pioneer Center for AI

Paper arXiv Code (Coming Soon)
Figure 1: Performance comparison and qualitative examples

Figure 1: Guided search in autoregressive models provides a more efficient path to high-quality image generation than scaling diffusion models. (Left) A 2B autoregressive model with beam search (green) surpasses a 12B FLUX.1-dev model using random search (orange), while requiring less computation. (Right) Examples showing how beam search corrects compositional errors in baseline generations.

Abstract

While inference-time scaling through search has revolutionized Large Language Models, translating these gains to image generation has proven difficult. Recent attempts to apply search strategies to continuous diffusion models show limited benefits, with simple random sampling often performing best.

We demonstrate that the discrete, sequential nature of visual autoregressive models enables effective search for image generation. We show that beam search substantially improves text-to-image generation, enabling a 2B parameter autoregressive model to outperform a 12B parameter diffusion model across benchmarks. These findings suggest that model architecture, not just scale, is critical for inference-time optimization in visual generation.

2x

Larger performance gains than diffusion model search strategies.

46%

Fewer function evaluations needed to achieve superior results.

6x

Smaller model (2B) with search beats a 12B parameter model.

Method Overview

Random Search

Generates N independent images and selects the best one. Maximizes diversity but is computationally inefficient.

Greedy Token Optimization

At each generation scale, greedily commits to the single best token. Efficient but can get stuck in local optima.

Beam Search

Maintains multiple parallel generation paths (beams), balancing exploration with computational efficiency. This is our best-performing method.

Figure 2: Search Strategies for Autoregressive Image Generation

Figure 2: Search strategies for autoregressive image generation. (a) Random search generates images independently. (b) Greedy token optimization commits to best token at each step. (c) Beam search maintains multiple parallel sequences.

Quantitative Results

T2I-CompBench++ Benchmark

On the T2I-CompBench++ benchmark, our 2B autoregressive model with beam search consistently outperforms the 12B FLUX.1-dev diffusion model, demonstrating the power of architecture over raw parameter count for compositional tasks.

Method Model Color Shape Texture Spatial Numeracy Complex
Ma et al. + Search FLUX.1-dev (12B) 0.82 0.60 0.72 0.30 0.66 0.38
Ours + Beam search Infinity-2B (2B) 0.83 0.64 0.76 0.36 0.67 0.42

GenEval Benchmark

Beam search demonstrates substantial improvements across all GenEval categories, particularly in multi-object composition (+19%), counting tasks (+25%), and spatial positioning (+26%).

Method One Obj. Two Obj. Count Colors Position Color Attr. Overall
Baseline 1.00 0.78 0.60 0.85 0.25 0.55 0.67
Beam Search 1.00 0.97 0.85 0.90 0.51 0.74 0.83
Improvement 0.00 +0.19 +0.25 +0.05 +0.26 +0.19 +0.16

Qualitative Comparison: Verifier Analysis

Category
Prompt
Baseline
Beam + ImageReward
Beam + LLaVA
Color
"A yellow frog and a green fly"
Color baseline
Color beam+IR
Color beam+LLaVA
Shape
"A round cookie and a square tin"
Shape baseline
Shape beam+IR
Shape beam+LLaVA
Texture
"A metallic car and leather shoes"
Texture baseline
Texture beam+IR
Texture beam+LLaVA
Spatial
"A man on the right of a lamp"
Spatial baseline
Spatial beam+IR
Spatial beam+LLaVA
Numeracy
"Five candles"
Numeracy baseline
Numeracy beam+IR
Numeracy beam+LLaVA
Complex
"Red hat on a brown coatrack"
Complex baseline
Complex beam+IR
Complex beam+LLaVA

Figure 4: Qualitative comparison of verifiers on T2I-CompBench++. For attribute binding (e.g., Color), both verifiers perform well. For complex reasoning (e.g., Spatial), LLaVA-OneVision's capabilities are required to guide the model to a semantically correct image, correcting errors that ImageReward misses.

Additional Qualitative Results

Baseline

Bee left of key - Baseline

Search-Guided

Bee left of key - Search-Guided
"Bee left of key"

Baseline

Giraffe right of wallet - Baseline

Search-Guided

Giraffe right of wallet - Search-Guided
"Giraffe right of wallet"

Baseline

Green rose, blue tulip - Baseline

Search-Guided

Green rose, blue tulip - Search-Guided
"Green rose, blue tulip"

Baseline

Six keys - Baseline

Search-Guided

Six keys - Search-Guided
"Six keys"

Baseline

Bird next to refrigerator - Baseline

Search-Guided

Bird next to refrigerator - Search-Guided
"Bird next to refrigerator"

Baseline

Six ducks - Baseline

Search-Guided

Six ducks - Search-Guided
"Six ducks"

Baseline

Small button big zipper - Baseline

Search-Guided

Small button big zipper - Search-Guided
"Small button, big zipper"

Baseline

Bird top of balloon - Baseline

Search-Guided

Bird top of balloon - Search-Guided
"Bird on top of balloon"

Baseline

Small lion big horse - Baseline

Search-Guided

Small lion big horse - Search-Guided
"Small lion, big horse"

Baseline

Fish near car - Baseline

Search-Guided

Fish near car - Search-Guided
"Fish near car"

Baseline

Suitcase right of cow - Baseline

Search-Guided

Suitcase right of cow - Search-Guided
"Suitcase right of cow"

Baseline

Five kites - Baseline

Search-Guided

Five kites - Search-Guided
"Five kites"

Baseline

Tall sunflower short daisy - Baseline

Search-Guided

Tall sunflower short daisy - Search-Guided
"Tall sunflower, short daisy"

Baseline

Four lamps four dogs - Baseline

Search-Guided

Four lamps four dogs - Search-Guided
"Four lamps, four dogs"

Baseline

Three clocks - Baseline

Search-Guided

Three clocks - Search-Guided
"Three clocks"

Baseline

Four pens - Baseline

Search-Guided

Four pens - Search-Guided
"Four pens"

Baseline

Two bowls three microwaves two chickens - Baseline

Search-Guided

Two bowls three microwaves two chickens - Search-Guided
"Two bowls, three microwaves, two chickens"

Baseline

Two sofas four chickens - Baseline

Search-Guided

Two sofas four chickens - Search-Guided
"Two sofas, four chickens"

Baseline

Two trains - Baseline

Search-Guided

Two trains - Search-Guided
"Two trains"

Baseline

Green apple red kiwi - Baseline

Search-Guided

Green apple red kiwi - Search-Guided
"Green apple, red kiwi"

Baseline

Vase right of man - Baseline

Search-Guided

Vase right of man - Search-Guided
"Vase right of man"

Baseline

Green frog yellow fly - Baseline

Search-Guided

Green frog yellow fly - Search-Guided
"Green frog, yellow fly"

Baseline

Woman right of TV - Baseline

Search-Guided

Woman right of TV - Search-Guided
"Woman right of TV"

Baseline

Rubber tire fabric pillow - Baseline

Search-Guided

Rubber tire fabric pillow - Search-Guided
"Rubber tire, fabric pillow"

Baseline

Wooden desk leather jacket - Baseline

Search-Guided

Wooden desk leather jacket - Search-Guided
"Wooden desk, leather jacket"

Baseline

Seven balloons - Baseline

Search-Guided

Seven balloons - Search-Guided
"Seven balloons"

Baseline

Wooden toy fabric pants - Baseline

Search-Guided

Wooden toy fabric pants - Search-Guided
"Wooden toy, fabric pants"

Baseline

Six bread - Baseline

Search-Guided

Six bread - Search-Guided
"Six bread"

Baseline

Wooden fork glass bowl - Baseline

Search-Guided

Wooden fork glass bowl - Search-Guided
"Wooden fork, glass bowl"

BibTeX

@article{Riise2025VisualAR,
  title  = {Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling},
  author = {Erik Riise and Mehmet Onurcan Kaya and Dim P. Papadopoulos},
  year   = {2025},
  journal={arXiv preprint arXiv:2510.16751},
  year={2025},
  url={https://erir11.github.io/visual-autoregressive-search/}
}