Local-level matching methods exhibit near-zero gradient cosine similarity between iterations, even with significant spatial overlap between crops. This stems from ViTs' translation sensitivity and an overlooked asymmetry: source crops reshape the pixel-space gradient landscape, while target crops merely shift the feature-space reference.
(a) Gradient similarity vs. IoU between two crops. (b) Cosine similarity of consecutive source gradients across iterations.
We reformulate the objective as an expectation over local transformations within an asymmetric framework:
$$\min_{\lVert \mathbf{X}_\text{sou} \rVert_p \le \epsilon} \mathbb{E}_{\mathcal{T} \sim \mathcal{D},\, y \sim \mathcal{Y}} \left[ \mathcal{L}\!\left(f\!\left(\mathcal{T}(\mathbf{X}_{\text{sou}})\right),\, y\right) \right]$$
where $\mathcal{D}$ is the distribution of local transformations and $\mathcal{Y}$ the target semantic distribution. This highlights the intrinsic asymmetry: embedding content $y$ into a locally transformed source $\mathcal{T}(\mathbf{X}_{\text{sou}})$. Our two enhancements, Multi-Crop Alignment (MCA) and Auxiliary Target Alignment (ATA), improve the expectation estimation and the sampling quality of $\mathcal{Y}$, respectively.
MCA averages gradients from $K$ independent crops per iteration, yielding a low-variance estimate of the expected gradient. This produces smoother gradient patterns and accelerates convergence compared to single-crop alignment.
(a) Optimization trajectories with different K. (b) Gradient patterns: single-crop (M-Attack) vs. multi-crop (M-Attack-V2).
Selecting a representative target embedding $y \in \mathcal{Y}$ is challenging since $\mathcal{Y}$ is unobservable. M-Attack explores via transformed views of the target, but radical crops drift too far while conservative ones provide little signal.
ATA introduces $P$ auxiliary images $\{\mathbf{X}_\text{aux}^{(p)}\}_{p=1}^P$ as additional semantic anchors. With mild transformations $\tilde{\mathcal{T}} \sim \tilde{\mathcal{D}}$ applied to each anchor, the combined objective becomes:
$$\hat{\mathcal{L}} = \frac{1}{K} \sum_{k=1}^{K} \Big[ \mathcal{L}(f(\mathcal{T}_k(\mathbf{X}_\text{sou})), y_0) + \frac{\lambda}{P} \sum_{p=1}^{P} \mathcal{L}(f(\mathcal{T}_k(\mathbf{X}_{\text{sou}})), \tilde{y}_p) \Big]$$
where $y_0 = f(\hat{\mathcal{T}}_0(\mathbf{X}_\text{tar}))$, $\tilde{y}_p = f(\tilde{\mathcal{T}}_p(\mathbf{X}_\text{aux}^{(p)}))$, and $\lambda \in [0,1]$ interpolates between target fidelity and auxiliary diversity. ATA achieves a better exploration-exploitation balance by allocating its shift budget toward semantically meaningful exploration via the auxiliary set.
M-Attack-V2 consistently outperforms all existing methods across GPT-5, Claude 4.0-thinking, and Gemini 2.5-Pro, achieving the highest attack success rates with strong imperceptibility.
Comparison with state-of-the-art approaches on commercial black-box LVLMs.
Visual comparison across methods. M-Attack-V2 produces more effective yet more imperceptible perturbations.
@article{zhao2026pushingfrontierblackboxlvlm,
title={Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting},
author={Zhao, Xiaohan and Li, Zhaoyi and Luo, Yaxin and Cui, Jiacheng and Shen, Zhiqiang},
journal={arXiv preprint arXiv:2602.17645},
year={2026}
}