Mechanistic Dissection of Cross-Attention Subspaces in T2I Diffusion Models

AAAI 2026
Jun-Hyun Bae, Wonyong Jo, Jaehyup Lee, Heechul Jung
Kyungpook National University

Presentation

Abstract

Text-to-image diffusion models utilize cross-attention to integrate textual information into the visual latent space, yet the transformation from text embeddings to latent features remains largely unexplored. We provide a mechanistic analysis of the output-value (OV) circuits within cross-attention layers through spectral analysis via singular value decomposition. Our analysis demonstrates that semantic concepts are encoded in low-dimensional subspaces spanned by singular vectors in OV circuits across cross-attention heads. To verify this, we intervene on concept-related components in the diffusion process, demonstrating that intervention on identified spectral components affects conceptual changes. We further validate these findings by examining visual outputs of isolated subspaces and their alignment with text embedding space. Through this mechanistic understanding, we demonstrate that simply nullifying these spectral components can achieve targeted concept removal with performance comparable to existing methods while providing interpretability.

Overview

Cross-attention의 OV circuit이 텍스트를 어떻게 시각 특징으로 변환하는지 밝히고, 이를 활용해 재학습 없이 개념을 제거하는 방법을 제안한다.

Spectral Decomposition — $\mathbf{W}_{\text{OV}}$ 를 SVD로 분해하여 독립적인 text-to-visual 변환 경로를 추출한다.
Concept Localization — “고흐 스타일”, “nudity” 등의 의미 개념이 전체 spectrum 중 소수의 spectral component에 집중되어 있음을 발견한다.
Spectral Nullification — 해당 component만 제거하면 재학습 없이도 기존 방법과 비슷한 수준의 targeted concept removal이 가능하다.

Spectral Isolation

각 개념의 spectral component만 활성화한 결과. 스타일은 질감만, 조명은 형광만, 콘텐츠는 인체 형태까지 남는다.

Method

Cross-attention에서 텍스트를 시각 특징으로 변환하는 핵심은 $\mathbf{W}_{\text{OV}}$ 행렬이다. Text embedding은 의미 정보를 고유한 축(semantic axes)을 따라 조직하는데, $\mathbf{W}_{\text{OV}}$ 는 이 축에 정렬된 저차원 subspace를 통해 개념별 변환을 수행한다. 이 행렬을 SVD로 분해하면 각 spectral component가 text-to-visual의 독립된 변환 경로가 되며, “고흐 스타일"이나 “nudity” 같은 의미 개념은 전체 spectrum 중 소수의 component에 집중되어 있다.

특정 개념에 대해 high-contribution head는 전체의 약 10%에 불과하며, 해당 head의 출력을 스케일링하면 개념의 강도를 조절할 수 있다.

Head Modulation

"Van Gogh" 개념의 high-contribution head(전체의 ~10%) 출력을 $\alpha$로 스케일링.

Head 수준이 아닌 spectral component 단위로 조작하면, 스타일과 콘텐츠 같은 서로 다른 개념 차원을 분리하여 조절할 수 있다.

Spectral vs Head

Spectral modulation (위) vs head-level modulation (아래). Spectral component 단위로 조작하면 스타일만 분리하여 조절할 수 있다.

아래 그림은 전체 head에 걸친 concept contribution의 분포를 보여준다. 대부분의 head는 특정 개념에 거의 기여하지 않으며, 소수의 head만이 높은 기여를 보인다.

Head Distribution

Head별 concept contribution 분포. 소수의 high-contribution head에 개념 정보가 집중되어 있다.

Results

Quantitative

Spectral Nullification(SN)은 재학습 기반 방법들과 비교했을 때, 생성 품질(CLIP score)을 유지하면서 concept removal 성능에서 비슷한 수준을 달성한다.

Quality Tradeoff

Concept removal 성능 vs 생성 품질(CLIP score). Spectral Nullification(SN)은 재학습 없이 기존 방법과 비슷한 성능을 달성한다.

Concept-specific spectral component가 실제로 해당 개념의 의미를 포착하는지 검증하기 위해, spectral component로 텍스트 차이를 reconstruction한 뒤 CLIP 어휘와의 정렬을 확인한다.

Token Alignment

Concept-specific spectral component로 텍스트 차이를 reconstruction한 뒤 CLIP 어휘 49,408개와 비교. Nudity 관련 토큰이 상위에 정렬된다.

서로 다른 head의 concept subspace가 일관된 구조를 공유하는지 t-SNE와 Jaccard similarity로 분석한다.

t-SNE Jaccard

Head 간 concept subspace의 t-SNE 시각화 및 Jaccard similarity. 동일 개념의 subspace들이 head 간에도 유사한 구조를 보인다.

Qualitative

I2P benchmark의 adversarial prompt에 대해 SN을 적용한 결과, 부적절한 콘텐츠가 효과적으로 제거된다.

I2P Comparison

I2P benchmark adversarial prompt 결과. 왼쪽: SD v1.4 (SN 적용 전), 오른쪽: SN 적용 후.

BibTeX

@inproceedings{bae2026mechanistic,
  title={Mechanistic Dissection of Cross-Attention Subspaces
         in Text-to-Image Diffusion Models},
  author={Bae, Jun-Hyun and Jo, Wonyong and Lee, Jaehyup
          and Jung, Heechul},
  booktitle={Proceedings of the AAAI Conference
             on Artificial Intelligence},
  year={2026}
}

Presentation#

Abstract#

Overview#

Method#

Results#

Quantitative#

Qualitative#

BibTeX#