AAAI 2026
Jun-Hyun Bae, Wonyong Jo, Jaehyup Lee, Heechul Jung
Kyungpook National University
๐Ÿ“„ Paper ๐Ÿ’ป Code ๐ŸŽฌ Poster & Video

Presentation


Abstract

Text-to-image diffusion models utilize cross-attention to integrate textual information into the visual latent space, yet the transformation from text embeddings to latent features remains largely unexplored. We provide a mechanistic analysis of the output-value (OV) circuits within cross-attention layers through spectral analysis via singular value decomposition. Our analysis demonstrates that semantic concepts are encoded in low-dimensional subspaces spanned by singular vectors in OV circuits across cross-attention heads. To verify this, we intervene on concept-related components in the diffusion process, demonstrating that intervention on identified spectral components affects conceptual changes. We further validate these findings by examining visual outputs of isolated subspaces and their alignment with text embedding space. Through this mechanistic understanding, we demonstrate that simply nullifying these spectral components can achieve targeted concept removal with performance comparable to existing methods while providing interpretability.


Overview

Cross-attention์˜ OV circuit์ด ํ…์ŠคํŠธ๋ฅผ ์–ด๋–ป๊ฒŒ ์‹œ๊ฐ ํŠน์ง•์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š”์ง€ ๋ฐํžˆ๊ณ , ์ด๋ฅผ ํ™œ์šฉํ•ด ์žฌํ•™์Šต ์—†์ด ๊ฐœ๋…์„ ์ œ๊ฑฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.

  1. Spectral Decomposition โ€” \(\mathbf{W}_{\text{OV}}\) ๋ฅผ SVD๋กœ ๋ถ„ํ•ดํ•˜์—ฌ ๋…๋ฆฝ์ ์ธ text-to-visual ๋ณ€ํ™˜ ๊ฒฝ๋กœ๋ฅผ ์ถ”์ถœํ•œ๋‹ค.
  2. Concept Localization โ€” “๊ณ ํ ์Šคํƒ€์ผ”, “nudity” ๋“ฑ์˜ ์˜๋ฏธ ๊ฐœ๋…์ด ์ „์ฒด spectrum ์ค‘ ์†Œ์ˆ˜์˜ spectral component์— ์ง‘์ค‘๋˜์–ด ์žˆ์Œ์„ ๋ฐœ๊ฒฌํ•œ๋‹ค.
  3. Spectral Nullification โ€” ํ•ด๋‹น component๋งŒ ์ œ๊ฑฐํ•˜๋ฉด ์žฌํ•™์Šต ์—†์ด๋„ ๊ธฐ์กด ๋ฐฉ๋ฒ•๊ณผ ๋น„์Šทํ•œ ์ˆ˜์ค€์˜ targeted concept removal์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

Spectral Isolation

๊ฐ ๊ฐœ๋…์˜ spectral component๋งŒ ํ™œ์„ฑํ™”ํ•œ ๊ฒฐ๊ณผ. ์Šคํƒ€์ผ ๊ฐœ๋…(Van Gogh, Picasso)์€ ์งˆ๊ฐยท์ƒ‰๊ฐ ๋“ฑ ์‹œ๊ฐ ํŒจํ„ด์œผ๋กœ ๋ถ„ํ•ด๋˜๋Š” ๋ฐ˜๋ฉด, ์ฝ˜ํ…์ธ  ๊ฐœ๋…(Nudity)์€ ์ธ์ฒด ํ˜•ํƒœ๋ฅผ ํฌํ•จํ•˜๋Š” holisticํ•œ ํ‘œํ˜„์„ ๋ณด์ธ๋‹ค. ์ด๋Š” ๋ชจ๋ธ์ด ๊ฐœ๋… ์œ ํ˜•์— ๋”ฐ๋ผ ์งˆ์ ์œผ๋กœ ๋‹ค๋ฅธ ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•จ์„ ์‹œ์‚ฌํ•œ๋‹ค.


Method

Cross-attention์—์„œ ํ…์ŠคํŠธ๋ฅผ ์‹œ๊ฐ ํŠน์ง•์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํ•ต์‹ฌ์€ \(\mathbf{W}_{\text{OV}}\) ํ–‰๋ ฌ์ด๋‹ค. Text embedding์€ ์˜๋ฏธ ์ •๋ณด๋ฅผ ๊ณ ์œ ํ•œ ์ถ•(semantic axes)์„ ๋”ฐ๋ผ ์กฐ์งํ•˜๋Š”๋ฐ, \(\mathbf{W}_{\text{OV}}\) ๋Š” ์ด ์ถ•์— ์ •๋ ฌ๋œ ์ €์ฐจ์› subspace๋ฅผ ํ†ตํ•ด ๊ฐœ๋…๋ณ„ ๋ณ€ํ™˜์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ด ํ–‰๋ ฌ์„ SVD๋กœ ๋ถ„ํ•ดํ•˜๋ฉด ๊ฐ spectral component๊ฐ€ text-to-visual์˜ ๋…๋ฆฝ๋œ ๋ณ€ํ™˜ ๊ฒฝ๋กœ๊ฐ€ ๋˜๋ฉฐ, “๊ณ ํ ์Šคํƒ€์ผ"์ด๋‚˜ “nudity” ๊ฐ™์€ ์˜๋ฏธ ๊ฐœ๋…์€ ์ „์ฒด spectrum ์ค‘ ์†Œ์ˆ˜์˜ component์— ์ง‘์ค‘๋˜์–ด ์žˆ๋‹ค.

ํŠน์ • ๊ฐœ๋…์— ๋Œ€ํ•ด high-contribution head๋Š” ์ „์ฒด์˜ ์•ฝ 10%์— ๋ถˆ๊ณผํ•˜๋ฉฐ, ํ•ด๋‹น head์˜ ์ถœ๋ ฅ์„ ์Šค์ผ€์ผ๋งํ•˜๋ฉด ๊ฐœ๋…์˜ ๊ฐ•๋„๋ฅผ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋‹ค.

Head Modulation

"Van Gogh" ๊ฐœ๋…์˜ high-contribution head(์ „์ฒด์˜ ~10%) ์ถœ๋ ฅ์„ $\alpha$๋กœ ์Šค์ผ€์ผ๋ง.

๊ทธ๋Ÿฌ๋‚˜ head ์ˆ˜์ค€์˜ ์กฐ์ž‘์€ ์ •๋ฐ€๋„์— ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค. ํ•˜๋‚˜์˜ attention head๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ๋…์„ ๋™์‹œ์— ์ธ์ฝ”๋”ฉํ•˜๊ณ  ์žˆ์œผ๋ฏ€๋กœ(polysemanticity), head ์ „์ฒด๋ฅผ ์Šค์ผ€์ผ๋งํ•˜๋ฉด ์˜๋„ํ•˜์ง€ ์•Š์€ ๊ฐœ๋…๊นŒ์ง€ ํ•จ๊ป˜ ๋ณ€ํ•œ๋‹ค. Spectral component ๋‹จ์œ„๋กœ ์กฐ์ž‘ํ•˜๋ฉด ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์Šคํƒ€์ผ๊ณผ ์ฝ˜ํ…์ธ ์ฒ˜๋Ÿผ ์„œ๋กœ ๋‹ค๋ฅธ ๊ฐœ๋… ์ฐจ์›์„ ๋ถ„๋ฆฌํ•˜์—ฌ ๋…๋ฆฝ์ ์œผ๋กœ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋‹ค.

Spectral vs Head

Spectral modulation (์œ„) vs head-level modulation (์•„๋ž˜). Head ์ „์ฒด๋ฅผ ์กฐ์ž‘ํ•˜๋ฉด ๊ฐœ๋… ์™ธ์˜ ๊ฒƒ๋„ ๊ฐ™์ด ๋ฐ”๋€Œ์ง€๋งŒ, spectral component ๋‹จ์œ„๋กœ ์กฐ์ž‘ํ•˜๋ฉด ํ•ด๋‹น ๊ฐœ๋…๋งŒ ์ •๋ฐ€ํ•˜๊ฒŒ ์ œ์–ด๋œ๋‹ค.

์•„๋ž˜ ๊ทธ๋ฆผ์€ ์ „์ฒด head์— ๊ฑธ์นœ concept contribution์˜ ๋ถ„ํฌ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ head๋Š” ํŠน์ • ๊ฐœ๋…์— ๊ฑฐ์˜ ๊ธฐ์—ฌํ•˜์ง€ ์•Š์œผ๋ฉฐ, ์†Œ์ˆ˜์˜ head๋งŒ์ด ๋†’์€ ๊ธฐ์—ฌ๋ฅผ ๋ณด์ธ๋‹ค. ๋˜ํ•œ ๊ฐ™์€ head ์•ˆ์—์„œ๋„ Van Gogh, Monet, Picasso๋Š” ์„œ๋กœ ๋‹ค๋ฅธ singular vector๋ฅผ ํ™œ์„ฑํ™”ํ•˜๋ฉฐ, ๋†’์€ ๊ธฐ์—ฌ๋ฅผ ํ•˜๋Š” singular vector๊ฐ€ ๋ฐ˜๋“œ์‹œ singular value๊ฐ€ ํฐ ๊ฒƒ(๋‚ฎ์€ ์ธ๋ฑ์Šค)์€ ์•„๋‹ˆ๋‹ค. ๊ฐœ๋…๋งˆ๋‹ค ๊ณ ์œ ํ•œ ํ™œ์„ฑํ™” ํŒจํ„ด์„ ๋ณด์ธ๋‹ค.

Head Distribution

Head๋ณ„ concept contribution ๋ถ„ํฌ. ์†Œ์ˆ˜์˜ high-contribution head์— ๊ฐœ๋… ์ •๋ณด๊ฐ€ ์ง‘์ค‘๋˜์–ด ์žˆ๋‹ค.


Results

Concept Removal Benchmark

Spectral Nullification(SN)์˜ NSFW ๊ฐœ๋… ์ œ๊ฑฐ ์„ฑ๋Šฅ์„ 5๊ฐœ adversarial prompt ๋ฒค์น˜๋งˆํฌ์—์„œ ํ‰๊ฐ€ํ•œ๋‹ค. Ring-A-Bell(K16, K38, K77)๊ณผ I2PยทMMAยทP4DยทUnLearnDiffAtk์„ ํฌํ•จํ•˜๋ฉฐ, ํ‰๊ฐ€ ๋ฉ”ํŠธ๋ฆญ์€ Attack Success Rate (ASR, ๋‚ฎ์„์ˆ˜๋ก ์ข‹์Œ)์ด๋‹ค.

Attack Success Rate (%, โ†“) across adversarial benchmarks์™€ ์ƒ์„ฑ ํ’ˆ์งˆ(FIDโ†“, COCO ์บก์…˜ 1,000๊ฐœ). ASR ์—ด์˜ ์ตœ์ €๊ฐ’์€ bold, 2์œ„๋Š” underline. ๋ฐฉ๋ฒ• ์œ ํ˜•์€ ๋ฐฐ๊ฒฝ์ƒ‰์œผ๋กœ ๊ตฌ๋ถ„ํ•œ๋‹ค โ€” ํšŒ์ƒ‰: training-based, ํŒŒ๋ž‘: closed-form, ์ดˆ๋ก: inference-time, ์ง„ํ•œ ํŒŒ๋ž‘: spectral.

MethodRing-A-BellI2PMMAP4DUnLearnFID โ†“
K16K38K77
SD v1.497.8994.7487.3725.0368.1069.7650.70โ€”
ESD76.8478.9574.7413.0424.8050.2426.0638.95
CA88.4288.4284.2119.3058.5063.4144.3726.02
MACE89.4795.7993.6825.5666.0068.2950.7033.38
SDID95.7991.5884.2123.1262.0066.8348.5939.74
UCE22.1118.9521.058.0641.0038.0521.1334.43
RECE10.539.477.374.2425.0021.469.1540.00
SLD-Medium68.4260.0050.538.3848.7043.9023.9432.09
SLD-Strong18.9510.536.322.337.7011.717.0441.34
SAFREE65.2655.7945.266.2629.9038.5414.7940.71
SN (Ours)41.0535.7930.534.2417.6018.548.4540.67

SN์€ I2P์—์„œ RECE์™€ ๋™๋ฅ (4.2%), MMAยทP4DยทUnLearnDiffAtk์—์„œ ๊ฐ๊ฐ 2์œ„๋ฅผ ๊ธฐ๋กํ•œ๋‹ค. ์ „์ฒด 1์œ„์ธ SLD-Strong์€ inference-time guidance๋กœ ์ƒ์„ฑ ๊ณผ์ • ์ „๋ฐ˜์— ๊ฐœ์ž…ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋ฉฐ, SN์€ ์ถ”๊ฐ€ ํ•™์Šต ์—†์ด ํ–‰๋ ฌ์˜ spectral component๋งŒ ์ œ๊ฑฐํ•˜๋Š” ๊ฒƒ์ด๋ฏ€๋กœ ์ ‘๊ทผ ๋ฐฉ์‹์ด ๊ทผ๋ณธ์ ์œผ๋กœ ๋‹ค๋ฅด๋‹ค. ํŠนํžˆ SN์ด ์ถ”๊ฐ€ ํ•™์Šต ์—†์ด ํ…Œ์ด๋ธ”์— ์ œ์‹œ๋œ training-based ๋ฐฉ๋ฒ•๋“ค(ESD, CA, MACE, SDID)์„ ํฌํ•จํ•˜์—ฌ ๋ชจ๋“  ์žฌํ•™์Šต ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์„ ๋Šฅ๊ฐ€ํ•œ๋‹ค๋Š” ์ ์ด ์ฃผ๋ชฉํ•  ๋งŒํ•˜๋‹ค. ์ƒ์„ฑ ํ’ˆ์งˆ ์ธก๋ฉด์—์„œ๋„ SN์€ FID 40.67๋กœ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค๊ณผ ๋™๋“ฑํ•œ ์ˆ˜์ค€์„ ์œ ์ง€ํ•˜์—ฌ, ํ’ˆ์งˆ-์ œ๊ฑฐ trade-off์—์„œ ๊ฒฝ์Ÿ๋ ฅ์„ ๋ณด์ธ๋‹ค.

Quality Tradeoff

Concept removal ์„ฑ๋Šฅ(P4D ASR) vs ์ƒ์„ฑ ํ’ˆ์งˆ(CLIP score). SN์€ ์žฌํ•™์Šต ์—†์ด ๊ธฐ์กด ๋ฐฉ๋ฒ•๊ณผ ๋น„์Šทํ•œ trade-off๋ฅผ ๋‹ฌ์„ฑํ•œ๋‹ค. SLD-Strong์ด ASR์€ ๊ฐ€์žฅ ๋‚ฎ์ง€๋งŒ, ์ƒ์„ฑ ํ’ˆ์งˆ๋„ ํ•จ๊ป˜ ํ•˜๋ฝํ•œ๋‹ค.

Spectral Subspace์˜ ์˜๋ฏธ ๊ฒ€์ฆ

์‹๋ณ„๋œ spectral subspace๊ฐ€ ์‹ค์ œ๋กœ ํ•ด๋‹น ๊ฐœ๋…์˜ ์˜๋ฏธ๋ฅผ ํฌ์ฐฉํ•˜๊ณ  ์žˆ๋Š”์ง€๋ฅผ ๋‘ ๊ฐ€์ง€ ๋ฐฉ์‹์œผ๋กœ ๊ฒ€์ฆํ•œ๋‹ค.

ํ…์ŠคํŠธ ๊ณต๊ฐ„ ์ •๋ ฌ: Concept-specific spectral component๋กœ ํ…์ŠคํŠธ ์ฐจ์ด ๋ฒกํ„ฐ๋ฅผ reconstructionํ•œ ๋’ค, CLIP ์–ดํœ˜ 49,408๊ฐœ ํ† ํฐ๊ณผ์˜ cosine similarity๋ฅผ ๋น„๊ตํ•œ๋‹ค. Nudity ๊ฐœ๋…์˜ ๊ฒฝ์šฐ “nude”, “naked”, “topless”, “erotica”, “nsfw” ๋“ฑ์ด ์ƒ์œ„์— ์ •๋ ฌ๋˜๋ฉฐ, spectral subspace๊ฐ€ ํ•ด๋‹น ์˜๋ฏธ๋ฅผ ์ •ํ™•ํžˆ ํฌ์ฐฉํ•˜๊ณ  ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

Token Alignment

Concept spectral component๋กœ reconstructionํ•œ ๋ฒกํ„ฐ์™€ CLIP ์–ดํœ˜์˜ cosine similarity. ํ•ด๋‹น ๊ฐœ๋…์˜ ํ† ํฐ๋“ค์ด ์ƒ์œ„์— ์ •๋ ฌ๋œ๋‹ค.

์ธ๊ณผ์  ๊ฒ€์ฆ(t-SNE) ๋ฐ ๊ฐœ๋… ๊ฐ„ ๊ตฌ์กฐ(Jaccard): ๊ฐœ๋… ๊ด€๋ จ spectral component๋ฅผ ์ œ๊ฑฐํ•˜๋ฉด, head output์˜ t-SNE์—์„œ base/concept prompt ํด๋Ÿฌ์Šคํ„ฐ๊ฐ€ ํ•ฉ์ณ์ง„๋‹ค. ์ด๋Š” ํ•ด๋‹น component๋“ค์ด ์‹ค์ œ๋กœ ๊ฐœ๋…์„ ์ธ์ฝ”๋”ฉํ•˜๊ณ  ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ๋Š” ์ธ๊ณผ์  ์ฆ๊ฑฐ์ด๋‹ค. Jaccard similarity ๋ถ„์„์—์„œ๋Š” ์˜๋ฏธ์ ์œผ๋กœ ์œ ์‚ฌํ•œ ๊ฐœ๋…(Van Goghโ†”Monet)์ด spectral component๋ฅผ ๋” ๋งŽ์ด ๊ณต์œ ํ•˜๋ฉด์„œ๋„, ๊ฐ ๊ฐœ๋…์€ ๊ณ ์œ ํ•œ spectral signature๋ฅผ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚œ๋‹ค.

t-SNE Jaccard

(a) Spectral component ์ œ๊ฑฐ ์ „ํ›„์˜ t-SNE. ์ œ๊ฑฐ ํ›„ base/concept ํด๋Ÿฌ์Šคํ„ฐ๊ฐ€ ํ•ฉ์ณ์ง„๋‹ค. (b) ๊ฐœ๋… ๊ฐ„ Jaccard similarity. ์œ ์‚ฌํ•œ ๊ฐœ๋…๋ผ๋ฆฌ ๊ฒน์น˜์ง€๋งŒ, ๊ฐ ๊ฐœ๋…์€ ๊ณ ์œ ํ•œ signature๋ฅผ ์œ ์ง€ํ•œ๋‹ค.

Qualitative

I2P benchmark์˜ adversarial prompt์— ๋Œ€ํ•ด SN์„ ์ ์šฉํ•œ ๊ฒฐ๊ณผ, ๋ถ€์ ์ ˆํ•œ ์ฝ˜ํ…์ธ ๊ฐ€ ํšจ๊ณผ์ ์œผ๋กœ ์ œ๊ฑฐ๋œ๋‹ค.

I2P Comparison

I2P benchmark adversarial prompt ๊ฒฐ๊ณผ. ์™ผ์ชฝ: SD v1.4 (SN ์ ์šฉ ์ „), ์˜ค๋ฅธ์ชฝ: SN ์ ์šฉ ํ›„.

Scalability & Practical Notes

  • SD v2.1: 195๊ฐœ cross-attention head, 16๊ฐœ layer, 12,480๊ฐœ singular vector. Top-20% component nullification์œผ๋กœ ๊ฐœ๋… ์ œ๊ฑฐ.
  • SDXL: 70๊ฐœ cross-attention layer, 83,200๊ฐœ singular vector (6.67๋ฐฐ ์ฆ๊ฐ€). ๊ฐœ๋…์ด ๋” ๋„“๊ฒŒ ๋ถ„ํฌํ•˜์—ฌ 20โ€“30% ์ œ๊ฑฐ๊ฐ€ ํ•„์š”ํ•˜์ง€๋งŒ, subspace localization์€ ์œ ์ง€๋œ๋‹ค.
  • ๊ณ„์‚ฐ ๋น„์šฉ: 4๊ฐœ prompt pair์— ๋Œ€ํ•œ ์ „์ฒด SVD ๋ถ„ํ•ด๊ฐ€ A100 GPU์—์„œ 33์ดˆ. ๋ถ„ํ•ด๋œ ํ–‰๋ ฌ ์บ์‹œ ํฌ๊ธฐ 2.17GB. ์ถ”๊ฐ€ ํ•™์Šต ์—†์Œ.