AAAI 2026
Jun-Hyun Bae, Wonyong Jo, Jaehyup Lee, Heechul Jung
Kyungpook National University
📄 Paper 💻 Code 🎬 Poster & Video
Translated by Claude Opus 4.7

Presentation


Abstract

Text-to-image diffusion models utilize cross-attention to integrate textual information into the visual latent space, yet the transformation from text embeddings to latent features remains largely unexplored. We provide a mechanistic analysis of the output-value (OV) circuits within cross-attention layers through spectral analysis via singular value decomposition. Our analysis demonstrates that semantic concepts are encoded in low-dimensional subspaces spanned by singular vectors in OV circuits across cross-attention heads. To verify this, we intervene on concept-related components in the diffusion process, demonstrating that intervention on identified spectral components affects conceptual changes. We further validate these findings by examining visual outputs of isolated subspaces and their alignment with text embedding space. Through this mechanistic understanding, we demonstrate that simply nullifying these spectral components can achieve targeted concept removal with performance comparable to existing methods while providing interpretability.


Overview

We reveal how OV circuits in cross-attention transform text into visual features, and propose a retraining-free method for targeted concept removal.

  1. Spectral Decomposition — Decompose \(\mathbf{W}_{\text{OV}}\) via SVD to extract independent text-to-visual transformation pathways.
  2. Concept Localization — Discover that semantic concepts such as “Van Gogh style” or “nudity” concentrate in a small subset of spectral components.
  3. Spectral Nullification — Remove only these components to achieve targeted concept removal without retraining, matching existing methods.

Spectral Isolation

Spectral isolation of each concept. Style concepts (Van Gogh, Picasso) decompose into visual patterns such as texture and color, while content concepts (Nudity) retain holistic human forms. This suggests that the model employs qualitatively different encoding strategies depending on concept type.


Method

The key mechanism that transforms text into visual features in cross-attention is the \(\mathbf{W}_{\text{OV}}\) matrix. Text embeddings organize semantic information along intrinsic axes, and \(\mathbf{W}_{\text{OV}}\) learns low-dimensional subspaces aligned with these axes to perform concept-specific transformations. Decomposing this matrix via SVD yields spectral components that each serve as independent text-to-visual pathways, and semantic concepts like “Van Gogh style” or “nudity” concentrate in a small subset of these components.

Only about 10% of all heads contribute highly to any given concept, and scaling their outputs modulates the concept’s intensity.

Head Modulation

Scaling the output of high-contribution heads (~10% of all heads) for the "Van Gogh" concept with factor $\alpha$.

However, head-level manipulation has limited precision. Since individual attention heads encode multiple concepts simultaneously (polysemanticity), scaling an entire head alters unintended concepts as well. Operating at the spectral component level resolves this issue, enabling disentangled control over distinct concept dimensions such as style and content.

Spectral vs Head

Spectral modulation (top) vs head-level modulation (bottom). Head-level manipulation changes attributes beyond the target concept, while spectral-level manipulation provides precise control over only the intended concept.

The figure below shows the distribution of concept contributions across all heads. Most heads contribute minimally to any given concept, while a small number of heads carry the majority of the signal. Furthermore, within the same head, Van Gogh, Monet, and Picasso activate different singular vectors, and the high-contribution singular vectors do not necessarily correspond to the largest singular values (lowest indices). Each concept exhibits a unique activation pattern.

Head Distribution

Distribution of concept contributions across heads. Concept information is concentrated in a small number of high-contribution heads.


Results

Concept Removal Benchmark

We evaluate Spectral Nullification (SN) for NSFW concept removal across five adversarial prompt benchmarks — Ring-A-Bell (K16, K38, K77), I2P, MMA, P4D, and UnLearnDiffAtk. The metric is Attack Success Rate (ASR; lower is better).

Attack Success Rate (%, ↓) across adversarial benchmarks and generation quality (FID↓) on 1,000 COCO captions. Best ASR per column in bold, second-best underlined. Method types are color-coded — gray: training-based, blue: closed-form, green: inference-time, deep blue: spectral.

MethodRing-A-BellI2PMMAP4DUnLearnFID ↓
K16K38K77
SD v1.497.8994.7487.3725.0368.1069.7650.70
ESD76.8478.9574.7413.0424.8050.2426.0638.95
CA88.4288.4284.2119.3058.5063.4144.3726.02
MACE89.4795.7993.6825.5666.0068.2950.7033.38
SDID95.7991.5884.2123.1262.0066.8348.5939.74
UCE22.1118.9521.058.0641.0038.0521.1334.43
RECE10.539.477.374.2425.0021.469.1540.00
SLD-Medium68.4260.0050.538.3848.7043.9023.9432.09
SLD-Strong18.9510.536.322.337.7011.717.0441.34
SAFREE65.2655.7945.266.2629.9038.5414.7940.71
SN (Ours)41.0535.7930.534.2417.6018.548.4540.67

SN ties with RECE for 2nd place on I2P (4.2%) and ranks 2nd on MMA, P4D, and UnLearnDiffAtk. The overall best, SLD-Strong, is an inference-time guidance method that intervenes throughout the generation process, while SN removes only the spectral components of the weight matrix, with no additional training — a fundamentally different approach. Notably, without any additional training, SN surpasses all retraining-based methods — including the training-based baselines in the table (ESD, CA, MACE, SDID). On generation quality, SN also maintains FID 40.67, matching existing methods and remaining competitive on the quality–removal trade-off.

Quality Tradeoff

Concept removal performance (P4D ASR) vs generation quality (CLIP score). SN achieves a competitive trade-off without retraining. SLD-Strong achieves the lowest ASR but at the cost of reduced generation quality.

Verifying the Semantics of Spectral Subspaces

We verify that the identified spectral subspaces genuinely capture the semantics of their target concepts through two approaches.

Text space alignment: We reconstruct text difference vectors using concept-specific spectral components and compute cosine similarity against all 49,408 CLIP vocabulary tokens. For the nudity concept, tokens such as “nude”, “naked”, “topless”, “erotica”, and “nsfw” rank at the top, confirming that the spectral subspace accurately captures the intended semantics.

Token Alignment

Cosine similarity between reconstructed vectors from concept spectral components and the CLIP vocabulary. Concept-related tokens rank at the top.

Causal verification (t-SNE) and inter-concept structure (Jaccard): Removing concept-related spectral components causes the base/concept prompt clusters to merge in the t-SNE of head outputs. This provides causal evidence that these components are indeed responsible for encoding the concept. Jaccard similarity analysis reveals that semantically similar concepts (Van Gogh↔Monet) share more spectral components, while each concept retains a unique spectral signature.

t-SNE Jaccard

(a) t-SNE before and after spectral component removal. Clusters merge after removal. (b) Jaccard similarity between concepts. Similar concepts overlap but each maintains a unique signature.

Qualitative

Applying SN to adversarial prompts from the I2P benchmark effectively removes inappropriate content.

I2P Comparison

I2P benchmark with adversarial prompts. Left: SD v1.4 (before SN). Right: after SN.

Scalability & Practical Notes

  • SD v2.1: 195 cross-attention heads, 16 layers, 12,480 singular vectors. Top-20% component nullification for concept removal.
  • SDXL: 70 cross-attention layers, 83,200 singular vectors (6.67× increase). Concepts distribute more broadly, requiring 20–30% removal, but subspace localization is preserved.
  • Computational cost: Full SVD decomposition for 4 prompt pairs takes 33 seconds on a single A100 GPU. Cached decomposition size is 2.17GB. No additional training required.