ASSET Seminar: “Alignment and Control with Representation Engineering”
/
Amy Gutmann Hall, Room 414
3333 Chestnut Street, Philadelphia, United States
Abstract: Large Language Models (LLMs) are vulnerable to adversarial attacks, which bypass common safeguards put in place to prevent these models from generating harmful output. Notably, these attacks can be transferrable to other models---even proprietary ones—potentially compromising a wide range of AI systems with a single exploit. This surprising fragility underscores a critical weakness in […]


