FOLDS seminar: Theory and practice of LLM quantization
October 2, 2025 at 12:00 PM - 1:00 PM
Zoom link: https://upenn.zoom.us/j/98220304722
Modern LLMs process information by repeatedly applying a basic primitive of matrix multiplication. Estimates show that about 60-84% of the energy consumed by LLMs goes into memory load/store operations. How can we reduce this power consumption? Tokens start as about 16-bit integers but get mapped to vectors of floats of length in the 1000s, suggesting very low information density per dimension. Thus, unsurprisingly there has been much success in reducing precision of both weights and activations without much loss in LLM performance. In this talk we will present information-theoretic analysis of quantized representations and show how it lead us to creating NestQuant, a new SOTA algorithm for weight/KV-cache/activations (ICML’2025).

