Name: ASSET Seminar: “Alignment and Control with Representation Engineering”
Start: 2025-04-09T12:00:00-04:00
End: 2025-04-09T13:15:00-04:00
Location: Amy Gutmann Hall, Room 414

ASSET Seminar: “Alignment and Control with Representation Engineering”

April 9, 2025 at 12:00 PM - 1:15 PM

Share this event

Add to Calendar

Details

Date: April 9, 2025

Time: 12:00 PM - 1:15 PM

Event Tags:ASSET, CIS, AI

Venue

Amy Gutmann Hall, Room 414 3333 Chestnut Street
Philadelphia
19104 Google Map

Abstract:

Large Language Models (LLMs) are vulnerable to adversarial attacks, which bypass common safeguards put in place to prevent these models from generating harmful output. Notably, these attacks can be transferrable to other models—even proprietary ones—potentially compromising a wide range of AI systems with a single exploit. This surprising fragility underscores a critical weakness in current AI safeguards.

In this talk, we illustrate how these attacks are discovered, and several recent advances that take advantage of models’ internal representations to thwart them. Unlike much prior work that relies on adversarial training methods, this approach directly controls neural representations responsible for harmful and unwanted behaviors, while remaining agnostic to particular attacks. Notably, in start contrast with prior work we show that these methods can remain effective while preserving the model’s performance on non-adversarial inputs. Our findings suggest that achieving robust safety in generative models may be an attainable goal.

Zoom Link:https://upenn.zoom.us/j/95869536469

ASSET Seminar: “Alignment and Control with Representation Engineering”

April 9, 2025 at 12:00 PM - 1:15 PM

Details

Venue

Read More