AV // SEC

Research Archive

Security Advisories & Core Publications

AI SecurityPublished

Mechanistic Alignment Abliteration in Qwen-3.5: Steering Activation States

A deep dive into bypassing alignment guardrails in Qwen3.5 models at the weight level. By identifying safety representation subspaces and orthoganalizing steering weights, we study representations drift and guardrail activations without full model fine-tuning.

#mechanistic-interpretability#alignment-bypass#activation-patching#weight-steering
2026-05-1812 min readRead Document