ASSET Seminar: “Alignment and Control with Representation Engineering”
April 9, 2025 at 12:00 PM - 1:15 PM
Details
Abstract:
Large Language Models (LLMs) are vulnerable to adversarial attacks, which bypass common safeguards put in place to prevent these models from generating harmful output. Notably, these attacks can be transferrable to other models—even proprietary ones—potentially compromising a wide range of AI systems with a single exploit. This surprising fragility underscores a critical weakness in current AI safeguards.
In this talk, we illustrate how these attacks are discovered, and several recent advances that take advantage of models’ internal representations to thwart them. Unlike much prior work that relies on adversarial training methods, this approach directly controls neural representations responsible for harmful and unwanted behaviors, while remaining agnostic to particular attacks. Notably, in start contrast with prior work we show that these methods can remain effective while preserving the model’s performance on non-adversarial inputs. Our findings suggest that achieving robust safety in generative models may be an attainable goal.
Zoom Link:https://upenn.zoom.us/j/95869536469

