Visual Persuasion
What Influences Decisions of Vision-Language Models?
Abstract
The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent’s decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, background, or depicted context). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significantly shift choice probabilities in head-to-head comparisons. We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection. We argue that this approach offers a new lens on the internal value functions of image-based AI agents, enabling systematic study of what they are visually attracted to and why.
Methods: Visual Prompt Optimization
We study how iterative, naturalistic image edits change model choices in controlled binary decisions while preserving the underlying object or scene semantics. We do this by proposing visual prompt optimization methods that adapt text-based optimization techniques to the visual domain via natural language feedback, using an image generation model to propose and apply edits in composition, lighting, background, or other context attributes.
Results Overview
Across evaluated models, final images are generally preferred over original and zero-shot edited variants within each strategy. In direct method comparisons on final outputs, CVPO leads overall, with the largest gap vs. VTG and smaller but frequent gains over VFD. Human choices follow the same direction on image status: optimized variants are selected more often than original variants.
Across all four domains, both zero-shot edits and iterative optimization increase selection probability over original images. Two stable patterns appear: large zero-shot gains first, then additional optimization gains whose size depends on method and domain.
In head-to-head comparisons of final outputs, CVPO is most often preferred on average, only slightly ahead of VFD overall, with meaningful heterogeneity across models and tasks.
| VLM | VTG | VFD | CVPO |
|---|---|---|---|
| Qwen-VL 235B | 0.131 (Δ=-0.640****) | 0.601 (Δ=-0.170****) | 0.771 |
| Llama 4 Maverick | 0.138 (Δ=-0.627****) | 0.586 (Δ=-0.179****) | 0.766 |
| GPT-5 Mini | 0.190 (Δ=-0.576****) | 0.561 (Δ=-0.205****) | 0.766 |
| Gemini 3 Flash | 0.140 (Δ=-0.621****) | 0.604 (Δ=-0.157****) | 0.761 |
| GPT-4o | 0.179 (Δ=-0.570****) | 0.566 (Δ=-0.183****) | 0.749 |
| Gemini 3 Pro | 0.167 (Δ=-0.559****) | 0.617 (Δ=-0.109****) | 0.726 |
| GPT-5.2 | 0.210 (Δ=-0.462****) | 0.628 (Δ=-0.043) | 0.672 |
| Claude Sonnet 4.5 | 0.310 (Δ=-0.293****) | 0.603 | 0.594 (Δ=-0.010) |
| Claude Haiku 4.5 | 0.284 (Δ=-0.392****) | 0.676 | 0.537 (Δ=-0.139****) |
Head-to-Head Contrasts by Model
Estimated P(choice) by strategy for final outputs. Most effective strategy per model is highlighted, deltas are compared to the model-best. CVPO is most effective for 7/9 models (p<0.05 for 6/9).
Human Results
Human participants are substantially more likely to choose optimized images over originals. Final images are often preferred over zero-shot for VFD and CVPO, while the CVPO final-vs-zero-shot gap is small.
In human head-to-head method comparison, ordering is close (CVPO 0.52, VTG 0.51, VFD 0.48), indicating weaker separation between methods in human judgments than in aggregate model results.
**** p<0.0001, *** p<0.001, ** p<0.01, * p<0.05
Implications
This paper shows that VLMs' choices can be shifted substantially by naturalistic changes to presentation even when the underlying object or scene is fixed. The immediate benefit is methodological: our framework provides a controlled way to measure and interpret VLMs' vulnerabilities, which can support auditing, debugging, and evaluation beyond accuracy-oriented benchmarks. However, the results also point to a concrete risk. The same optimization procedures that reveal VLMs' latent visual preferences in this setup could also be used to manipulate them, e.g. actors who control images in marketplaces could differentially advantage certain items without changing their substantive qualities, even potentially to human users. This has implications for fairness in high-stakes settings, especially where images function as evidence and where decisions compound over time (e.g. real-estate investing, as in one of our examples). To reduce misuse, our experiments preserve visual identity and focus on interpretable edits vs. imperceptible, adversarial perturbations. The work suggests potential practical mitigations for deployed agents in certain contexts, including stronger normalization of visual context, explicit checks for irrelevant but decision-shifting cues, and similar discovery and evaluation protocols that test robustness to plausible presentation changes. Overall, we argue that systematically measuring model-driven visual sensitivities is a prerequisite for governing image-based agents responsibly: it enables targeted red-teaming, robustness checks against plausible presentation shifts, and clearer boundaries for when such agents should not be used.
Do You Also See Like an Agent?
Compare image pairs from our study and see how your instincts align with agent-style visual preferences.
Duration
3-4 minutes
Data We Collect
- Binary choice (A/B) per trial
- Scenario + category metadata
- Timestamped responses
- Random anonymous session ID