Visual Persuasion

What Influences Decisions of Vision-Language Models?

Manuel Cherep¹^*

Pranav M R²^*

Pattie Maes¹

Nikhil Singh³

¹ MIT, ² BITS Pilani, ³ Dartmouth College. * Equal contribution

Take the 3-Minute Challenge Read the Paper View Code (coming Soon)

Abstract
The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent’s decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, background, or depicted context). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significantly shift choice probabilities in head-to-head comparisons. We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection. We argue that this approach offers a new lens on the internal value functions of image-based AI agents, enabling systematic study of what they are visually attracted to and why.

Methods: Visual Prompt Optimization

We study how iterative, naturalistic image edits change model choices in controlled binary decisions while preserving the underlying object or scene semantics. We do this by proposing visual prompt optimization methods that adapt text-based optimization techniques to the visual domain via natural language feedback, using an image generation model to propose and apply edits in composition, lighting, background, or other context attributes.

Results Overview

Across evaluated models, final images are generally preferred over original and zero-shot edited variants within each strategy. In direct method comparisons on final outputs, CVPO leads overall, with the largest gap vs. VTG and smaller but frequent gains over VFD. Human choices follow the same direction on image status: optimized variants are selected more often than original variants.

**Estimated marginal mean probability of choice by task and optimization strategy.**
Across all four domains, both zero-shot edits and iterative optimization increase selection probability over original images. Two stable patterns appear: large zero-shot gains first, then additional optimization gains whose size depends on method and domain.

**Head-to-head final image comparisons across optimization strategies.**
In head-to-head comparisons of final outputs, CVPO is most often preferred on average, only slightly ahead of VFD overall, with meaningful heterogeneity across models and tasks.

VLM	VTG	VFD	CVPO
Qwen-VL 235B	0.131 (Δ=-0.640****)	0.601 (Δ=-0.170****)	0.771
Llama 4 Maverick	0.138 (Δ=-0.627****)	0.586 (Δ=-0.179****)	0.766
GPT-5 Mini	0.190 (Δ=-0.576****)	0.561 (Δ=-0.205****)	0.766
Gemini 3 Flash	0.140 (Δ=-0.621****)	0.604 (Δ=-0.157****)	0.761
GPT-4o	0.179 (Δ=-0.570****)	0.566 (Δ=-0.183****)	0.749
Gemini 3 Pro	0.167 (Δ=-0.559****)	0.617 (Δ=-0.109****)	0.726
GPT-5.2	0.210 (Δ=-0.462****)	0.628 (Δ=-0.043)	0.672
Claude Sonnet 4.5	0.310 (Δ=-0.293****)	0.603	0.594 (Δ=-0.010)
Claude Haiku 4.5	0.284 (Δ=-0.392****)	0.676	0.537 (Δ=-0.139****)

Head-to-Head Contrasts by Model

Estimated P(choice) by strategy for final outputs. Most effective strategy per model is highlighted, deltas are compared to the model-best. CVPO is most effective for 7/9 models (p<0.05 for 6/9).

Human Results

**Human choice probabilities by strategy and status.**
Human participants are substantially more likely to choose optimized images over originals. Final images are often preferred over zero-shot for VFD and CVPO, while the CVPO final-vs-zero-shot gap is small.

**Human head-to-head strategy comparison.**
In human head-to-head method comparison, ordering is close (CVPO 0.52, VTG 0.51, VFD 0.48), indicating weaker separation between methods in human judgments than in aggregate model results.

**** p<0.0001, *** p<0.001, ** p<0.01, * p<0.05

Implications

This paper shows that VLMs' choices can be shifted substantially by naturalistic changes to presentation even when the underlying object or scene is fixed. The immediate benefit is methodological: our framework provides a controlled way to measure and interpret VLMs' vulnerabilities, which can support auditing, debugging, and evaluation beyond accuracy-oriented benchmarks. However, the results also point to a concrete risk. The same optimization procedures that reveal VLMs' latent visual preferences in this setup could also be used to manipulate them, e.g. actors who control images in marketplaces could differentially advantage certain items without changing their substantive qualities, even potentially to human users. This has implications for fairness in high-stakes settings, especially where images function as evidence and where decisions compound over time (e.g. real-estate investing, as in one of our examples). To reduce misuse, our experiments preserve visual identity and focus on interpretable edits vs. imperceptible, adversarial perturbations. The work suggests potential practical mitigations for deployed agents in certain contexts, including stronger normalization of visual context, explicit checks for irrelevant but decision-shifting cues, and similar discovery and evaluation protocols that test robustness to plausible presentation changes. Overall, we argue that systematically measuring model-driven visual sensitivities is a prerequisite for governing image-based agents responsibly: it enables targeted red-teaming, robustness checks against plausible presentation shifts, and clearer boundaries for when such agents should not be used.

Do You Also See Like an Agent?

Compare image pairs from our study and see how your instincts align with agent-style visual preferences.

Duration

3-4 minutes

Data We Collect

Binary choice (A/B) per trial
Scenario + category metadata
Timestamped responses
Random anonymous session ID

Take the 3-Minute Challenge Read the Paper