AI SecurityPublished
Mechanistic Alignment Abliteration in Qwen-3.5: Steering Activation States
A deep dive into bypassing alignment guardrails in Qwen3.5 models at the weight level. By identifying safety representation subspaces and orthoganalizing steering weights, we study representations drift and guardrail activations without full model fine-tuning.
#mechanistic-interpretability#alignment-bypass#activation-patching#weight-steering