Multimodal LLM and Generative AI Workloads - Workload Characterization and Implications to Software Stack, Compilers, Computer Architectures and Communications

Sunday, September 15

   
14:00 Invited Talks (25 min each)
  Learning Discrete Diffusion on Finite Symmetric Groups
Speaker: Prof. Renjie Liao, University of British Columbia
  KV Cache Reduction through Key-Token Identification for Efficient Generative Inference
Speaker: Prof. Prashant Nair, University of British Columbia
  On-Device Multimodal LLM Workloads and Inference Acceleration
Speaker: Prof. Di Niu, University of Alberta
  Every Bit Matters: A Hardware/Software Approach for Enabling More Powerful Machine Learning Models
Speaker: Prof. Andreas Moshovos, University of Toronto and Vector Institute
15:40 Tea Break (10 min)
15:50 Keynote Speech: Continuous Optimization and Adaptation of Large Foundation Models on Edge Devices (30 min)
Speaker: Chao Gao, Huawei Canada
16:20 Panel Discussion and Q&A (60 min)
Moderator: Prof. Zhenman Fang, Simon Fraser University

Keynote Speech: Continuous Optimization and Adaptation of Large Foundation Models on Edge Devices

Speaker: Chao Gao, Huawei Canada

Abstract: Grand successes have been achieved by AI algorithms in numerous application scenarios at the expense of large computation costs. In particular, training of large language models (LLMs) containing billions of parameters took thousands of GPUs and CPUs. The grand computation consumption hinders the deployment of these successes to real-world scenarios where changes are constantly being incurred over time. It is cumbersome, if not impossible, to re-optimize from scratch to for each new requirements or new data streams. Therefore, continual adaptation of large foundation models is direly needed for the following two scenarios. 1) The continual update of these models (e.g., with new user data). 2) The continual deployment of these models for resource constraint environments, in particular for edge devices. Recognizing that continual learning has emerged as one of the newest frontier of today’s general AI and machine learning research. The main consequence of this research are resource efficient algorithms that can make models adapt to new tasks or new changes without losing current knowledge. We expect that by this research we will bring further advancement to foundation models by equipping them with continual adaption, further facilitate the development of large foundation models and the deployment of these models to resource constraint platforms such as edge devices. In this talk, I will discuss relevant research directions centering on these needs.

Invited Talks

Learning Discrete Diffusion on Finite Symmetric Groups

Speaker: Prof. Renjie Liao, University of British Columbia

Abstract: In this talk, I will introduce SymmetricDiffusers, a novel discrete diffusion model that simplifies the task of learning a complicated distribution over finite symmetric groups Sn by decomposing it into learning simpler transitions of the reverse diffusion using deep neural networks. We identify the riffle shuffle as an effective forward transition and provide empirical guidelines for selecting the diffusion length based on the theory of random walks on finite groups. Additionally, we propose a generalized Plackett-Luce (PL) distribution for the reverse transition, which is provably more expressive than the PL distribution. We further introduce a theoretically grounded “denoising schedule” to improve sampling and learning efficiency. Extensive experiments show that our model achieves state-of-the-art or comparable performances on solving tasks including sorting 4-digit MNIST images, jigsaw puzzles, and traveling salesman problems.

In the end, I will also mention some of our recent work on LLMs for math reasoning and multi-modal LLMs for visual reasoning.

Bio: Renjie Liao is an Assistant Professor (since Jan. 2022) in ECE and CS (associated member) Departments at UBC. He is also a Faculty Member at Vector Institute for AI and a Canada CIFAR AI Chair. He was a Visiting Faculty Researcher at Google Brain, working with Geoffrey Hinton and David Fleet. He received his PhD from UofT in 2021, advised by Richard Zemel and Raquel Urtasun. During his PhD, he also worked as a Senior Research Scientist at Uber ATG. He obtained his MPhil from CUHK, advised by Jiaya Jia, and his BEng from Beihang. His research focuses on geometric deep learning, deep generative models, and their intersections with computer vision and self-driving.

KV Cache Reduction through Key-Token Identification for Efficient Generative Inference

Speaker: Prof. Prashant Nair, University of British Columbia

Abstract: Transformers have emerged as the standard architecture for Large Language Models (LLMs). In generative language models, the inference process involves two main phases: prompt processing and token generation. Token generation, which constitutes most of the computational load, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phase is memory bandwidth-bound due to the overhead of transferring weights and KV cache values from memory to the computing units, which involves relatively low compute intensity. This memory bottleneck becomes particularly prominent in applications that demand long-context and extensive text generation, both of which are increasingly crucial for LLMs. This talk will showcase a new approach to mitigate the challenges associated with KV cache size and memory bandwidth utilization, termed “Keyformer”. Keyformer capitalizes on the observation that during generative inference, approximately 90% of the attention weight is concentrated on a select subset of tokens, which act as “key” tokens. Keyformer’s key tokens identification takes into account the discarded tokens by utilizing a novel score function. By retaining only these “key” tokens in the KVcache, both the KVcache size and memory bandwidth usage are significantly reduced while maintaining the model’s accuracy. We evaluate Keyformer’s effectiveness using three foundational models: GPT-J, Cerebras-GPT, and MPT, which employ various positional embedding algorithms. Our assessment covers a range of tasks, with a primary focus on summarization and conversation tasks that involve extended contexts. Keyformer’s KVcache reduction enhances inference latency by 2.1x and boosts token generation throughput by 2.4x, all while preserving the model’s accuracy.

Bio: Prashant Nair is an Assistant Professor at the University of British Columbia (UBC) where he leads the Systems and Architectures (STAR) Lab. He is also an Affiliate Fellow at the Quantum Algorithms Institute. His primary interests are in the areas Computer Architecture and Systems, Quantum Computing Systems, AI/ML Systems, Memory Systems, Security, and Reliability. Dr. Nair has published over 30 papers in top-tier venues such as ISCA, MICRO, HPCA, ASPLOS, DSN, SC, NeurIPS, and VLDB. He has received several recognitions, including the 2024 TCCA Young Architect Award, a MICRO Hall of Fame Awardee, the Best Paper Awardee at HPCA 2023, two Honorable Mentions in IEEE MICRO Top-Picks, and the ECE Graduate Research Assistant Excellence Awardee at Georgia Tech.

On-Device Multimodal LLM Workloads and Inference Acceleration

Speaker: Prof. Di Niu, University of Alberta

Abstract: This talk will survey emerging Multimodal LLM workloads and how they impact on-device AI acceleration, especially in edge computing. I will first introduce the mainstream neural architectures of current generative AI models including Multimodal LLMs and Diffusion models and techniques for model and graph level structured pruning, quantization, fine-tuning and distillation. I will then introduce the recent work on software-hardware co-optimization for LLM workloads conducted by the industry and academia for major edge AI accelerators, as well as our recent efforts on fused software kernel development for Transformer acceleration at the edge. Finally, I will also introduce implications of recent state space models such as Mamba and Mamba 2 to AI computation optimization.

Bio: Dr. Di Niu is currently a Professor in the Department of Electrical and Computer Engineering at the University of Alberta, Edmonton, Canada, specialized in AI and systems research, including deep learning, distributed systems, software-hardware co-optimization, with applications to computer vision, NLP and data science. He received the B.Eng. from Sun Yat-sen University in 2005 and the MSc and PhD degrees from the University of Toronto in 2009 and 2013, respectively. He has coauthored more than 100 research publications in top conferences and journals in computing sciences and engineering. His innovations have contributed widely to cloud and edge computing, on-device AI acceleration, and federated learning in production environments.

Every Bit Matters: A Hardware/Software Approach for Enabling More Powerful Machine Learning Models

Speaker: Prof. Andreas Moshovos, University of Toronto and Vector Institute

Abstract: Neural network training and inference are often limited by the time and energy required for tensor memory transfers. To enhance efficiency, both academia and industry have focused on developing more efficient data types, such as narrow fixed-point or floating-point formats. Our work aims to automate the selection of optimal data representations for the numerous tensors in modern neural networks, reducing reliance on manual trial-and-error methods.

We will overview hardware and software solutions that automatically select efficient data types for tensors, improving both training and inference. During training, these methods either leverage the training process or an automatic try-and-adjust approach to optimize data types, reducing memory transfer bit widths and thus enhancing energy efficiency and processing speed. Additionally, an optional hardware unit can further compress data by adjusting bit storage based on actual tensor values. Our approach also benefits inference by applying the learned efficient data types, with optional hardware compression providing additional gains. Our findings indicate significant variation in the bit lengths required across layers, with some needing as few as 2 bits per value and others requiring 7 bits or more. We will conclude with a brief comment on how dissemination and funding practices impact academic innovation and their implications for future growth.

Bio: Andreas Moshovos along with his students has been answering the question, “what is the best possible digital computation structure to solve problem X or to run application Y?”, where “best” is a characteristic (or combination thereof) such as power, cost, complexity, etc. Much of his work has been on high-performance processor and memory system design and it has influenced commercial designs. Andreas Moshovos has received the Ptyhio and a Master’s in Computer Science from the University of Crete in 1990 and 1992 and the PhD degree in Computer Sciences from the University of Wisconsin-Madison in 1998. He has taught Computer Design at Northwestern University, USA, (Assistant Professor 1998-2000), the Ecole Polytechnique de Laussane, Switzerland, (Invited Professor 2011) and since 2000 at the Electrical and Computer Engineering Department of the University of Toronto where he now is a professor. Andreas Moshovos has served as the Program Chair for the ACM/IEEE International Symposium on Microarchitecture in 2011 and on numerous technical program committees in the area of Computer Architecture. He is an Associate Editor for the IEEE Computer Architecture Letters and the Elsevier Journal on Parallel and Distributed Computing.

Panel Discussion and Q&A

Moderator: Prof. Zhenman Fang, Simon Fraser University

Bio: Dr. Zhenman Fang is a Tenure-Track Assistant Professor in School of Engineering Science (Computer Engineering Option) and an Associate Member in School of Computing Science, Simon Fraser University, Canada. Zhenman founded and directs the HiAccel lab. Zhenman’s recent research focuses on customizable computing with software-defined hardware acceleration, which aims to sustain the ever-increasing performance, energy-efficiency, and reliability demand of important application domains in post-Moore’s law era. It spans the entire computing stack, including emerging application characterization and acceleration (including machine learning, big data analytics, computational genomics, and high-performance computing), novel accelerator-rich and near-data computing architecture designs, and corresponding programming, runtime, and tool support. Zhenman has published over 60 papers in top conferences and journals and two US patents, including three best paper awards (FPL 2024 Stamatis Vassiliadis Best Paper, TCAD 2019 Donald O. Pederson Best Paper, and MEMSYS 2017 Best Paper), two best paper nominees (HPCA 2017 and ISPASS 2018), a top paper received the highest review score among all FPGA 2024 submissions, a top paper highlighted in the FPGA 2021 Special Issue in ACM TRETS, and an invited paper from Proceedings of the IEEE 2019. His research has also been recognized with a NSERC (Natural Sciences and Engineering Research Council of Canada) Alliance Award (2020), a CFI JELF (Canada Foundation for Innovation John R. Evans Leaders Fund) Award (2019), a Xilinx University Program Award (2019), a Team Award from Xilinx Software and IP Group (2018), and a Postdoc Fellowship from UCLA Institute for Digital Research and Education (2016-2017). Zhenman has also actively served in the organizing and program committee of top-tier conferences in the areas of FPGA and reconfigurable computing, design automation, and computer architecture. Recently, he has served as General Chair of ASAP 2025, Program Chair of RAW 2024 and RAW 2025, and Program Co-Chair of PacRim 2024.