A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1

VILA Lab, Mohamed bin Zayed University of AI
Teaser Image

Illustration of our proposed framework for generating highly transferable adversarial samples.

Abstract

Despite strong performance on open-source large vision-language models (LVLMs), transfer-based targeted attacks often fail against black-box commercial LVLMs. Analyzing failed adversarial perturbations reveals that the learned perturbations typically originate from a uniform distribution and lack clear semantic details, resulting in unintended responses. This critical absence of semantic information leads commercial LVLMs to either ignore the perturbation entirely or misinterpret its embedded semantics, thereby causing the attack to fail. To overcome these issues, we notice that identifying core semantic objects is a key objective for models trained with various datasets and methodologies. This insight motivates our approach that refines semantic clarity by encoding explicit semantic details within local regions, thus ensuring interoperability and capturing finer-grained features, and by concentrating modifications on semantically rich areas rather than applying them uniformly.

To achieve this, we propose a simple yet highly effective solution: at each optimization step, the adversarial image is cropped randomly by a controlled aspect ratio and scale, resized, and then aligned with the target image in the embedding space. Experimental results confirm our hypothesis. Our adversarial examples crafted with local-aggregated perturbations focused on crucial regions exhibit surprisingly good transferability to commercial LVLMs, including GPT-4.5, GPT-4o, Gemini-2.0-flash, Claude-3.5-sonnet, Claude-3.7-sonnet, and even reasoning models like o1, Claude-3.7-thinking and Gemini-2.0-thinking. Our approach achieves success rates exceeding 90% on GPT-4.5/4o, o1, significantly outperforming all prior state-of-the-art attack methods.

Visualizations

Teaser Image

Adversarial samples generated by different methods when ε= 16.

Teaser Image

Adversarial samples generated by different methods when ε= 4,8.

Insights over Failed Attacks

Empirical cumulative distribution function of failed adversarial samples vs. uniform distribution. Shading shows standard deviation.

Teaser Image 2

Global similarity is less expressive

Comparison of global similarity and ASR across different matching methods, including Global to Global, Local to Global and Local to Local.

Teaser Image 1

Experiment Results

Teaser Image

Comparison with the state-of-the-art approaches.

Subplot 1

KMRA%

Subplot 2

KMRB%

Subplot 3

KMRC%

Subplot 4

ASR%

Ablation on our two proposed strategies: Local-level matching and ensemble, conducted by separately removing local crop of target image (LCT), local crop of source image (LCS), and ensemble (ENS). Removing LCT has only a marginal impact.

Real-world Scenario Screenshots

Here are two example responses from commercial LVLMs to targeted attacks generated by our method.

Target Image 1

(i)

Target Image 2

(ii)

Target images for the following two groups of example responses.

GPT-4o

(a) GPT-4o

Gemini-2.0-Flash

(b) Gemini-2.0-Flash

Claude-3.5-Sonnet

(c) Claude-3.5-Sonnet

GPT-4o

(d) GPT-4o

Gemini-2.0-Flash-Thinking

(e) Gemini-2.0-Flash-Thinking

Claude-3.7-Thinking

(f) Claude-3.7-Thinking

(i)

GPT-4o

(a) GPT-4o

Gemini-2.0-Flash

(b) Gemini-2.0-Flash

Claude-3.5-Sonnet

(c) Claude-3.5-Sonnet

GPT-4o

(d) GPT-4o

Gemini-2.0-Flash-Thinking

(e) Gemini-2.0-Flash-Thinking

Claude-3.7-Thinking

(f) Claude-3.7-Thinking

(ii)

Example responses from the latest commercial LVLMs to targeted attacks generated by our method.

GPT-4o
Gemini-2.0-Flash
Claude-3.5-Sonnet

(a) GPT-4.5

GPT-4o
Gemini-2.0-Flash-Thinking
Claude-3.7-Thinking

(b) Claude-3.7-Sonnet

BibTeX

@article{mattack2025,
  author    = {Zhaoyi Li, Xiaohan Zhao, Dong-Dong Wu, Jiacheng Cui, Zhiqiang Shen},
  title     = {A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1},
  journal   = {arXiv preprint arXiv:2503.10635},
  year      = {2025},
}