Loading Events

Spring 2025 GRASP Seminar: Mike Shou, National University of Singapore, “Video intelligence in the era of multimodal”

May 8, 2025 at 10:00 AM - 11:00 AM
Details
Date: May 8, 2025
Time: 10:00 AM - 11:00 AM
Event Category: Seminar
Organizer
General Robotics, Automation, Sensing and Perception (GRASP) Lab
Venue
Levine 307 3330 Walnut Street
Philadelphia
PA 19104
Google Map
This will be a hybrid event with a VIRTUAL speaker. The GRASP seminar will be streamed for in-person attendance in Levine 307 and virtual attendance on Zoom.

ABSTRACT

The past few years have witnessed great success in video intelligence, as supercharged by multimodal models. In this talk, I will start with a brief sharing of our efforts, in building video-language models for understanding and diffusion models for video generation. Yet, video understanding and generation have always been two separate research pillars, despite their strong synergy. This motivates us to develop Show-o, one unified single transformer that can do both multimodal understanding and generation. Show-o is the first to unify autoregressive and discrete diffusion modeling, flexibly supporting a wide range of vision-language tasks of any input/output format, including visual question-answering, text-to-image/video generation, and generation of video keyframes with captions, all within one single 1.3B transformer. Show-o sheds light for building the next-generation multimodal video foundation model, and has sparked many follow-up works already.