Name: Spring 2025 GRASP Seminar: Mike Shou, National University of Singapore, “Video intelligence in the era of multimodal”
Start: 2025-05-08T10:00:00-04:00
End: 2025-05-08T11:00:00-04:00
Location: Levine 307

Spring 2025 GRASP Seminar: Mike Shou, National University of Singapore, “Video intelligence in the era of multimodal”

May 8, 2025 at 10:00 AM - 11:00 AM

Share this event

Add to Calendar

Details

Date: May 8, 2025

Time: 10:00 AM - 11:00 AM

Event Category: Seminar

Organizer

General Robotics, Automation, Sensing and Perception (GRASP) Lab

Email: grasplab@seas.upenn.edu

Website: View Organizer Website

Venue

Levine 307 3330 Walnut Street
Philadelphia
PA 19104 Google Map

This will be a hybrid event with a VIRTUAL speaker. The GRASP seminar will be streamed for in-person attendance in Levine 307 and virtual attendance on Zoom.

ABSTRACT

The past few years have witnessed great success in video intelligence, as supercharged by multimodal models. In this talk, I will start with a brief sharing of our efforts, in building video-language models for understanding and diffusion models for video generation. Yet, video understanding and generation have always been two separate research pillars, despite their strong synergy. This motivates us to develop Show-o, one unified single transformer that can do both multimodal understanding and generation. Show-o is the first to unify autoregressive and discrete diffusion modeling, flexibly supporting a wide range of vision-language tasks of any input/output format, including visual question-answering, text-to-image/video generation, and generation of video keyframes with captions, all within one single 1.3B transformer. Show-o sheds light for building the next-generation multimodal video foundation model, and has sparked many follow-up works already.