Loading Events

ESE Spring Seminar – “Catch M(oor)e If You Can: Agile Hardware/Software Co-Design for Hyperscale Cloud Systems”

March 25, 2024 at 11:00 AM - 12:00 PM
Details
Date: March 25, 2024
Time: 11:00 AM - 12:00 PM
Event Category: Colloquium
  • Event Tags:
  • Organizer
    Electrical and Systems Engineering
    Phone: 215-898-6823
    Venue
    Glandt Forum, Singh Center for Nanotechnology 3205 Walnut Street
    Philadelphia
    PA 19104
    Google Map

    Global reliance on cloud services, powered by transformative technologies like generative AI, machine learning, and big-data analytics, is driving exponential growth in demand for hyperscale cloud compute infrastructure. Meanwhile, the breakdown of classical hardware scaling (e.g., Moore’s Law) is hampering growth in compute supply. Building domain-specific hardware can address this supply-demand gap, but catching up with exponential demand requires developing new hardware rapidly and with confidence that performance/efficiency gains will compound in the context of a complete system. These are challenging tasks given the status quo in hardware design, even before accounting for the immense scale of cloud systems.

    This talk will focus on two themes of my work: (1) Developing radical new agile, end-to-end hardware/software co-design tools that challenge the status quo in hardware design for systems of all scales and unlock the ability to innovate on new hardware at datacenter scale. (2) Leveraging these tools and insights from hyperscale datacenter fleet profiling to architect and implement state-of-the-art domain-specific hardware that addresses key efficiency challenges in hyperscale cloud systems.

    I will first cover my work creating the award-winning and widely used FireSim FPGA-accelerated hardware simulation platform, which provides unprecedented hardware/software co-design capabilities. FireSim automatically constructs high-performance, cycle-exact, scale-out simulations of novel hardware designs derived from the tapeout-friendly RTL code that describes them, empowering hardware designers and domain experts alike to directly iterate on new hardware designs in hours rather than years. FireSim also unlocks innovation in datacenter hardware with the unparalleled ability to scale to massive, distributed simulations of thousand-node networked datacenter clusters with specialized server designs and complete control over the datacenter architecture. I will then briefly cover my work co-creating the also widely used Chipyard platform for agile construction, simulation (including FireSim), and tape-out of specialized RISC-V System-on-Chip (SoC) designs using a novel, RTL-generator-driven approach.

    Next, I will discuss my work in collaboration with Google on Hyperscale SoC, a cloud-optimized server chip built, evaluated, and taped-out with FireSim and Chipyard. Hyperscale SoC includes my work on several novel domain-specific accelerators (DSAs) for expensive but foundational operations in hyperscale servers, including (de)serialization, (de)compression, and more. Hyperscale SoC demonstrates a new paradigm of data-driven, end-to-end hardware/software co-design, combining key insights from profiling Google’s world-wide datacenter fleet with the ability to rapidly build and evaluate novel hardware designs in FireSim/Chipyard. This instance of Hyperscale SoC is just the beginning; I will conclude by covering the wide-ranging opportunities that can now be explored for radically redesigning next generation hyperscale cloud datacenters.