tinyAI Forum on Generative AI on the Edge

Join the GenAI Revolution & Amplify its Impact on the Edge!

March 27-28, 2024

About tinyAI Forum on Generative AI on the Edge

tinyML Foundation Presents:
A tinyAI Virtual Forum on Generative AI and Foundation Models on the Edge

Dates: March 27-28, 2024
Time: 7-10am US Pacific Time
Format: LIVE and interactive virtual event (no pre-recorded sessions)

Foundation models and generative AI have enabled a new generation of cloud AI applications. Meanwhile, embedded machine learning and tinyML are bringing AI to the edge of the network. As hardware and algorithms become more efficient, these fields are starting to collide.

For the first time ever, this Virtual Forum brings together experts on edge AI, foundation models, and generative AI, bridging connections that will help us chart a new frontier for AI. How can we deploy foundation models to the edge? And what becomes possible when we do?

Foundation models are general purpose deep learning models, trained on huge, unlabelled datasets, that can be applied to many tasks. Generative AI models are those designed to produce realistic data: from writing and images to speech, audio, and signals. Edge AI is the deployment of AI to physical devices, where digital hardware meets the real world.

This Forum aims to review the state-of-the-art at the intersection of these fast-evolving fields. It will showcase the potentially transformative impacts that foundation models can bring to the edge—including in hardware, software, tooling, applications, and AI design methodologies.



Contact us


Two broad themes will be discussed:

  1. The current status and future outlook for foundation models and generative AI on resource-constrained embedded devices, and their role within the edge-to-cloud continuum.

  2. The foundation model and generative AI tools, methodologies and techniques that can assist tinyML and edge AI developers and designers, including EDA tools, assisted labeling, and synthetic data.

A key goal for the Forum is to be generative itself, inspiring the AI community towards collaborative work!



7:00 am to 7:05 am


Session Moderator: Davis SAWYER, Co-founder, Deeplite

7:05 am to 7:40 am

Visual Language Models for Edge AI 2.0

Song HAN, Associate Professor, MIT EECS

Abstract (English)
This talk presents edge AI innovations across the full stack: I’ll first present VILA (CVPR’24), a visual language model with multi-image reasoning and in-context learning capabilities. With strong zero shot learning capabilities, VILA 2.7B is deployable on Jetson Orin Nano. Followed by AWQ (MLSys’24), a 4-bit LLM quantization algorithm that boosts model efficiency, and TinyChat, an inference library that powering visual language model inference. VILA, AWQ and TinyChat enable advanced visual reasoning on the edge and brings new opportunities for edge AI applications.

7:40 am to 8:00 am

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

Zechun LIU, Research Scientist, Meta Reality Labs

Abstract (English)
Increasing cloud costs and latency concerns have sparked growing need for deploying efficient large language models (LLMs) on mobile devices. We design top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models on commonsense zero-shot reasoning tasks. Moreover, MobileLLM model family shows significant improvements compared to previous sub-billion models on chat benchmarks, and demonstrates close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases.

8:00 am to 8:20 am

GenAI and the Transformation of Edge Computing: Leveraging Heterogeneous Circuits for Innovation

Jose Angel Miranda CALERO, Post-doc research assistant, Embedded Systems Laboratory, EPFL

Abstract (English)

The integration of Generative AI (GenAI) into edge devices is pushing the boundaries of technology, from enhancing smartphone photography in real time to enabling immediate voice interactions in smart home systems. This presentation will explore the forefront of GenAI applications and their deployment in embedded systems. The challenge of incorporating GenAI into these systems is significant, as it involves local processing and generation of complex data within the constraints of limited computational resources and energy efficiency. To overcome these challenges, the development of specialized accelerators and the adoption of open-source frameworks are critical. A notable solution in this space involves the use of advanced, heterogeneous integrated circuits, which facilitate the rapid development of flexible and reconfigurable solutions. Such systems, characterized by their versatile processing units, offer the necessary balance between adaptability and computational power essential for GenAI tasks. Illustrative examples, including wearable health monitors providing instant diagnostic feedback, demonstrate the practicality of embedding GenAI in edge devices with appropriate strategies. This approach, focusing on the optimization of these sophisticated integrated circuits, outlines a method to navigate beyond the limitations of existing systems, thereby expanding the potential of GenAI in our daily lives.

8:20 am to 8:40 am

Generative AI on the Edge for Connected Vehicles and Mobility

Alok RANJAN, Research Architect, Bosch

Abstract (English)


Generative Artificial Intelligence (GenAI) is disrupting the technology landscape of several industries and many research groups have picked up this fast-evolving trend and demonstrated the impact and capability of GenAI to unlock new values. From texts to images, audio, video, new content/design generation to synthetic data; it has been extensively explored by the early adopters. Although the solution like ChatGPT has been a household name, GenAI adoption in different industry domains beyond custom chatbots and automation is still under progression phase due to obvious questions on security and privacy.

Most recently, automotive industry is going through the transformation journey in the direction of PACE (Personalization, Autonomous, Connected and Electrified) and AI based services are further fueling these technical advancements. From Passenger vehicles to commercial vehicles, there are now more advanced sensors generating high volume data which is further leveraged to offer connected services. Furthermore, in-vehicle connected features and personalization are moving towards more advancements and new features integration are on high demand. With the advent of GenAI and its capabilities, it is now feasible to offer hyper personalization features to the customers which shall increase customer experience to next level.

Thrust areas:

Although the recent advancements in GenAI domain have been discussed by the community, it is worth to mention that majority of applications including training/retraining to finetuning is majorly cloud native. As we know, traditional cloud-based architecture is limited by network bandwidth, data privacy, latency etc., which could be addressed using technologies like Edge Artificial Intelligence (Edge AI) and TinyML. Edge AI has been realized in certain use cases for the domain of connected vehicles such as object detection and classification, prognosis, gesture recognition with control and many others. GenAI particularly on the edge within the vehicular systems or connected vehicles is yet to be them mainstream for automotive industry. This is motivated from the fact that the current architecture, optimizations strategies and finetuning on custom business data need research advancements from the edge ecosystem perspectives.

In this presentation, we will discuss the best practices and some most recent hybrid edge-cloud architectures which have been realized to bring GenAI on the edge from connected vehicles and mobility perspectives. In particular, we will first present the specific use cases and how GenAI is helpful to offer hyper personalization services. We then discuss the advantages and benefits considering edge ecosystem. Future research directions where the community could help in advancing the domain from vehicular ecosystem will be also presented considering topics like specialized hardware architectures, optimization strategies, privacy preserving techniques such as edge federated learning in hybrid architecture.

ViT@Edge: Distilled Vision Transformer based Foundation Model for Efficient Edge Deployment

Hasib-Al RASHID, Ph.D. Student, University of Maryland

Abstract (English)


The rise of large-scale foundational models built on trans-former architectures has revolutionized AI capabilities across image recognition (Vision Transformers – ViTs [1]) and natural language processing (e.g., ChatGPT [2]). While these models demonstrate remarkable performance, their massive size and computational requirements present a fundamental obstacle to their deployment on resource-constrained edge devices. For instance, ViT-base [1] contains 86 million parameters, resulting in a 344 MB model – far too large for embedded systems. Our goal is to develop innovative compression techniques that drastically reduce the footprint of foundational transformer models, enabling their widespread adoption in edge and tinyML applications without compromising their breakthrough capabilities.


ViT@Edge proposes a novel solution that leverages the strengths of Transformer models within edge computing environments. Transformers, known for their foundational role in various domains such as NLP, Computer Vision, and multimodal areas, offer superior capabilities in modeling long-range dependencies. However, their complexity and the quadratic computational demand of their self-attention mechanism present significant challenges for real-world, industrial deployment, particularly in resource-constrained settings. On the other hand, while Convolutional Neural Networks (CNNs) are celebrated for their efficiency and practicality, especially in industrial applications due to their translation equivalence bias, they fall short in capturing global information. This is where ViT@Edge comes into play, merging the global information processing power of Transformers with the efficiency and practical deployment capabilities of CNNs. Given the distinct advantages and limitations of both Vision Transformer models and CNNs, exploring Knowledge Distillation (KD) between these two diverse architectures emerges as a fascinating area of study [3]–[5]. KD provides a pathway for distilling knowledge from large CNN models to ViT (Vision Transformer) models, obviating the necessity for extensive labeled datasets to supplement the inductive bias of the latter. This solution aims to address the computational hurdles of traditional Transformer models, making them viable for edge computing applications without sacrificing the comprehensive data understanding that Transformers provide. This approach not only optimizes computational resources but also maintains the adaptability and performance excellence of foundational models in various applications.


We have shown in [6] that with the inclusion of vanilla knowledge distillation with uniform 8-bit quantization we got 296× memory reduction for CNN-based multimodal pose classification task. Our Raspberry Pi 4B real-time deployment has 303.93 GOP/s/W power efficiency. With the proposed ViT@Edge, we expect similar or even more memory compression and power efficiency while deploying real-time edge processors/devices.


Our approach enhances CNNs by infusing them with global insights from vision transformers through representation-level distillation. This method surpasses traditional logitbased distillation, tapping into the deeper, interdependent knowledge within transformer representations. It is particularly effective for models trained via self-supervised methods, offering a versatile and task-agnostic solution. This task aims to:

• Conduct a comprehensive study of KD between Trans-formers and CNNs to leverage the strengths of both architectures.

• Develop methodologies for effective knowledge transfer from vision-transformer models to CNNs, focusing on both logits and representation levels.

• Explore and evaluate the impact of transferring global information from vision transformers to CNNs on various applications.

The exploration of Knowledge Distillation between Trans-former models and Convolutional Neural Networks serves as a bridge to amalgamate the strengths of both architectures, paving the way for innovative solutions in various domains.

  • YouTube

8:40 am to 9:00 am

Solve edge AI problems with foundation models

Daniel SITUNAYAKE, Founding tinyML Engineer, Edge Impulse

Abstract (English)

Generative AI has created a “moment” in technology: suddenly, everyone is aware of what machine learning can do. This talk goes beyond the hype to reveal the four key capabilities of foundation models, and how we can use them to solve real engineering problems at the edge—even when the models seem too big to fit.

9:00 am to 9:20 am

Running an LLM on a Raspberry Pi

Pete WARDEN, CEO, Useful Sensors

Abstract (English)

Large language models like ChatGPT are a great fit for edge devices, but it can be hard to figure out how to get started running them outside of the cloud. In this talk Pete will cover the basics of deploying an open source LLM on a Raspberry Pi 5, explaining some of the options and tradeoffs involved, and discuss the use cases that this sort of system can support. He’ll also discuss what customer problems these models can help with, and how to ensure the inevitable hallucinations don’t pose a safety risk!

9:20 am to 9:40 am

Using Generative Models to Improve Generative Models

Yubei CHEN, Assistant Professor, UC Davis

Abstract (English)
Training data plays a pivotal role in defining a model’s knowledge base. To effectively scale AI, we require a versatile “data simulator” capable of producing knowledge specific to any task. As we progress towards creating an all-encompassing world model, an intermediate step involves the strategic combination of various generative models. Each of these models replicates different aspects of reality, and together they can generate versatile data in concert. In this presentation, we will showcase a method for integrating multiple generative models, utilizing their basic strengths to generate sophisticated data. This technique not only refines the generative models themselves but also offers a solution to the issue of data scarcity in numerous edge AI applications. This marks an important step towards the future of data-centric efficient AI.

9:40 am to 10:00 am

Optimizing Large Language Model (LLM) Inference for Arm CPUs

Dibakar GOPE, Principal Engineer, Machine Learning & AI, Arm

Abstract (English)
Large language models (LLMs) have transformed how we think about language understanding and generation, enthralling researchers and developers. Facilitating the efficient execution of LLM models on commodity Arm CPUs will expand their reach to billions of compact devices, such as smartphones and edge devices. In this talk, we will present a set of optimizations to accelerate LLM inference on Arm CPUs. These optimizations will span matrix multiplication with low numerical precision and compression techniques to reduce memory traffic and improve overall model performance. In particular, we will discuss how developers can employ the SDOT and SMMLA instructions available on Arm CPUs in conjunction with 4-bit quantization schemas to bring efficient LLM everywhere, from smartphones to edge devices.

10:00 am to 10:20 am

Toward a Foundation Model for Efficient Damage Assessment Following Natural Disasters

Maryam RAHNEMOONFAR, Associate Professor of Computer Science and Engineering, Lehigh University

Abstract (English)

Natural disasters caused by climate change are becoming more frequent and severe. These disasters pose a threat to human health, infrastructure, and natural systems. In order to respond and recover quickly and effectively after a natural disaster such as a hurricane, wildfire, or flooding, access to aerial images is crucial for the response team. Small Unmanned Aerial Vehicles (UAVs) with costeffective sensors are a great solution for collecting thousands of images with high flexibility and easy maneuverability for rapid response and recovery. Furthermore, UAVs can access hardtoreach areas and perform datagathering tasks that are unsafe or impossible for humans. Combining multiple data modalities such as vision, language, and radar data is a promising technique for damage assessment. However, applying deep learning methods to radar time series images is challenging due to the lack of enough training data compared to optical images. Moreover, optical data can be difficult to use due to their limitations in all weather conditions. Traditional analyses provide some insights into the data, but the complexity, scale, and multimodality nature of the data require advanced, intelligent solutions. In this presentation, I will discuss some of our current innovative solutions such as generative models for multimodal imagery and explainable and interactive models for multimodal vision and language perception. Our goal is to provide an accurate damage assessment with multimodal data after a natural disaster and facilitate rapid response and recovery. I will also discuss our current efforts toward developing a foundation model that can be transferred to any robotbased multimodal downstream tasks with very few labeled data.

10:20 am to 10:30 am

The Voice of AI: ​Harnessing Voice Meta Data​ for GenAI Solutions​

Rick RADIA, Head of Product, MyVoice AI

Tom BARKER, Senior Machine Learning Engineer, MyVoice AI

Abstract (English)
The human voice continues to be at the forefront of communications and leading the way now with GenAI and conversational AI inputs, especially for the training of Large Language Models, and increasingly Small Language Models. The human voice carries a rich complement of biometric, unclonable identity, emotive states, age and other human attributes. Foundational and GenAI models are leveraging this aspect of humanity, and increasingly being used in GenAI cases.
During this session, we delve into the process of extracting metadata from one’s voice and leveraging it to enhance user experiences, all on-device through the advancement of tinyML technology. We will discuss various techniques for compressing large models to accommodate resource-constrained devices, enabling efficient inference at the edge. By employing methodologies such as structured pruning and quantization, we will showcase substantial neural network size reduction – scaling down from trillions (as exploited by the new Blackwell processor) to hundreds of thousands of parameters – all while maintaining optimal model performance.


7:00 am to 7:05 am

Welcome + Recap of Day 1

Session Moderator: Davis SAWYER, Co-founder, Deeplite

7:05 am to 7:50 am

Panel: "GenAI is accelerating the Edge"


Pete Bernard from EDGECELSIOR

Max Petrenko from AWS

Rajeev Muralidhar from AWS

Sally Ward-Foxton from EE Times

7:50 am to 8:10 am

On-Device Generative AI

Fatih PORIKLI, Sr. Director, Technology, Qualcomm Technologies, Inc.

Abstract (English)

Generative AI emerges as a transformative force, capable of creating new content such as text, code, images, video, audio, or other data, while handling complex dialogues and reasoning about problems. This disruptive technology is reshaping traditional approaches across various domains, from search algorithms and content creation to automation and problem solving, and redefining the user interface to computing devices. Its impact transcends industries, promising substantial advancements in utility, productivity, and efficiency. As the adoption of generative AI accelerates at an unprecedented pace, driving its computational demands to surge, on-device processing becomes more important than ever.

In this presentation, you’ll learn about:

  • The pivotal role of on-device AI deployment.
  • Full-stack AI optimizations to make on-device AI possible and efficient
  • Advanced techniques like distillation, quantization, and speculative decoding
  • How we deploy generative AI models on device with examples, including LVMs, LLMs, LMMs
  • Qualcomm Technologies’ role in scaling on-device generative AI

8:10 am to 8:30 am

Architecture 2.0: Prompt Engineering and Foundation Models for Edge AI Hardware Design

Vijay Janapa REDDI, Associate Professor, Harvard University

Abstract (English)
TinyML enables AI capabilities on resource-constrained edge devices, but designing specialized hardware for TinyML remains challenging due to high costs, long development cycles, and error risks. The small margins in the TinyML market demand a visionary approach to hardware design. This forward-looking talk introduces Architecture 2.0, a shift toward leveraging foundation models and generative AI to rethink edge AI hardware design. Central to this vision is prompt engineering, adapting foundation models to edge devices’ unique requirements and constraints. Prompt engineering techniques can assist designers in navigating performance, power, and area trade-offs, ultimately leading to accelerated hardware design, reduced costs, and minimized error risks. The talk explores a few examples of prompt engineering applied to generate TinyML hardware designs. But this is only the beginning; thus, the talk emphasizes the critical role of collaboration between TinyML and foundation model communities. Fostering cross-disciplinary partnerships will drive the development of novel prompting techniques, model architectures, and generative AI tools tailored for hardware design, unlocking a new era of efficient and accessible edge AI solutions that promise to drive innovation. This talk aims to inspire the community to collaborate and craft a shared vision, exploring the transformative potential of foundation models and generative AI in TinyML hardware design, and work together towards realizing Architecture 2.0.

8:30 am to 8:50 am

Empowering Collaborative Chip Design: Leveraging Generative AI for Custom Accelerators and Edge AI Innovation

Mohamed KASSEM, CTO and Co-Founder, Efabless

Abstract (English)

In this presentation, we delve into how an open-source chip design SoC platform stimulates collaborative chip design, leverages generative AI to optimize custom chip designs and develop tailored accelerators for small machine learning models.

Through collaborative community efforts, we explore the transformative impact of generative AI and custom accelerators, while addressing challenges and future directions in deploying these technologies at the edge.

Join us to discover the convergence of generative AI, custom chip design, and edge AI innovation, and explore new horizons in AI-driven hardware design.

8:50 am to 9:10 am

A technology game changer: How GenAI Will Reshape Learning

Gian Marco IODICE, Team and Tech Lead in the Machine Learning Group, Arm

Milja LAAKSO, Programme Specialist, UNICEF Learning Innovation Hub

Abstract (English)
How can the tech industry come together and harness the rapidly evolving fields of AI, including GenAI to benefit millions of children around the world who are not learning? As of 2022, 70% of 10-year-olds in low- and middle-income countries are unable to read and understand a simple statement by the end of primary school.
The promise of frontier technologies as a catalyst to help children learn is no longer a “nice to have” but an emergency need for reaching ALL children everywhere. Yet as the world has been captivated by the rapid advancements in GenAI, millions of children still miss out even from benefitting from the advent of the internet due to lack of electricity, connectivity, and even access to devices.
This session will explore how technologies combined with AI can be a game changer that could finally bridge the digital divide and help address the learning crisis.
This session is a collaborative effort fueled by the innovative partnership between Arm and the UNICEF Global Learning Innovation Hub–Office of Innovation to inspire the tech industry. Together, Arm and UNICEF aim to catalyze change on the global education stage, setting the foundation for alternative futures where every child embarks on a fascinating adventure of learning.
Don’t miss the chance to join this impactful session shaping the narrative of education for generations to come.

9:10 am to 9:30 am

Inspired by ‘Her’: AI interaction models we’d like to see in the world

Savannah KUNOVSKY, Managing Director, IDEO

Abstract (English)

When humans design paradigm-shifting technologies, they often create a similar interaction model as the technologies that came before. Take a look at ChatGPT or other generative AI systems, and you’ll notice they look a lot like a Google search bar. We’re still in the awkward, early phases of AI, where the technology is so new, the mainstream interaction models align with we’re already familiar with—even if they don’t make the most of the new technology. In light of this, we’ve envisioned alternative AI interaction models, drawing inspiration from the film “Her” and the principles of calm technology to propose new ideas for AI interfaces.

9:30 am to 10:00 am

From Cloud to Edge: Rethinking Generative AI for Low-Resource Design Challenges

Sai Krishna Revanth VARUMA, Doctoral Candidate in Computer Science, University of South Carolina

Abstract (English)

Generative Artificial Intelligence (AI) has shown tremendous prospects in all aspects of technology, including design. However, due to its heavy demand on resources, it is usually trained on large computing infrastructure and often made available as a cloud-based service. In this position paper, we consider the potential, challenges, and promising approaches for generative AI for design on the edge, i.e., in resource-constrained settings where memory, compute, energy (battery) and network connectivity may be limited. Adapting generative AI for such settings involves overcoming significant hurdles, primarily in how to streamline complex models to function efficiently in low-resource environments. This necessitates innovative approaches in model compression, efficient algorithmic design, and perhaps even leveraging edge computing. The objective is to harness the power of generative AI in creating bespoke solutions for design problems, such as medical interventions, farm equipment maintenance, and educational material design, tailored to the unique constraints and needs of remote areas. These efforts could democratize access to advanced technology and foster sustainable development, ensuring universal accessibility and environmental consideration of AI-driven design benefits.

AI/ML for Embodied Systems at the Edge: Generative Models, LLMs and Beyond

Andreas ANDREOU, Professor, Johns Hopkins University

Abstract (English)


Operation and decision making at the EDGE for embodied applications such as autonomous robotics requires real-time local processing with extreme energy efficiency and low latency. Hardware must perform insightful information extraction from sensing signals to symbols, and distill knowledge into models necessary for reasoning and inference to generate action signals where computations are done by COTS or specialized hardware at the edge. A canonical model of processing in embodied systems is shown in the diagram on the right.


The Role of Generative AI in Robust Reasoning:

Generative AI such as Large Language Models (LLMs) can play a key role for real-time learning and Contextual Modeling so that subsequently the machine can perform robust reasoning and produce desired behavior (actions). Despite the recent explosion of advances in Large Language Models for text (ChatGPT) and images (Dall-E, Midjourney) generative AI, the computational structures “under the hood” can have a broad impact. For example, graphical models such as the Diffusion Models in Dalle-E or Deep Belief Networks (DBNs), are of generative nature consisting of multiple layers of nodes connected as Markov random fields where sampling plays a central role. The latter are computationally intensive necessitating micro-architectural components available on custom hardware such as SpiNNaker SOC Arm M4 chip multiprocessor2 or FPGAs. Action recognition without a camera at the edge4, necessitates the recognition of action using low dimensional data (time series of micro-Doppler acoustic signatures) but it is trained on signatures from high dimensional signals that can be generated from Kinect 3D cloud data or physical model (Knowledge). Continuous life-long learning to keep the Knowledge up to date necessitates generative AI and sampling in Conditional Restricted Boltzmann Machines (RCBMs) or Conditional Deep Belief Networks (CDBNs). An example is shown on the side where a micro-Doppler time series of an action (top) is used to seed a CDBN and the CDBN “hallucinates” the time series of response (bottom).


Generative AI for next Generation AI machines:

The natural language prompting of ChatGPT can also be employed to produce synthesizable Verilog for chip design. We have employed ChatGPT4 for natural language driven hardware design. The AI-generated design, a synthesizable and functional verilog description for the entirety of a programmable Spiking Neuron Array including an SPI interface, for neurocomputing at the edge. The latter was verified in simulation using handcrafted testbenches and is currently fabricated in Skywater 130nm CMOS technology through Tiny Tapeout 5 using an open-source EDA flow.

LLM Pipelines: Seamless Integration on Embedded Devices

Enzo RUEDAS, AI Engineering Student, NXP Semiconductors

Abstract (English)

Large Language Models (LLMs) and broader Generative Artificial Intelligence have gained increasing prominence in the AI landscape. Various initiatives, including Hugging Face and libraries such as GGML, have played a crucial role in facilitating the accessibility and development of LLMs. Nevertheless, deploying such models on embedded devices remains extremely challenging, given the inherent constraints of computational power and memory. NXP’s LLM Pipelines project aims to enhance user experience with LLMs on embedded devices, facilitating more accessible deployment and improving human-machine interactions.

This presentation details our solutions to improve LLMs porting through quantization and fine-tuning. In particular, our experiments focus high end NXP MPUs, such as:

– i.MX 8M Plus featuring a 4x Arm Cortex-A53 Processor and a Neural Processing Unit (NPU)

– i.MX 93 featuring a 2x Arm Cortex-A55 Processor and an NPU

– i.MX 95 featuring a 6x Arm Cortex-A55 Processor and an NPU

When deploying AI models in resource-constrained environments, machine learning quantization techniques offer several significant benefits, including reductions in model size and memory footprint, as well as faster execution time. However, most integer quantization techniques can result in important accuracy drops, especially in auto-regressive models. The LLM Pipelines project features advanced quantization algorithms, encompassing model compression, dynamic quantization and latest post-training static quantization techniques. Our presentation will focus on comparing these different approaches.

On the other hand, most use-cases for embedded LLMs necessitate specialization, either to limit computational costs and usage or to mitigate hallucinations and biases. For example, a car assistant should focus on assisting the driver with vehicle-related tasks, avoiding unrelated topics like politics. Using Retrieval Augmented Generation (RAG), we explore various fine-tuning scenario for the smart assistant, utilizing user manual knowledge or even interacting with machine sensors. This presentation will address different RAG- related challenges, including constricting the input prompt to meet hardware requirements and handling out-of-topic queries.

10:00 am to 10:10 am

Edge of Tomorrow: Unleashing the Power of Small LLMs for Generative AI at the Edge

Mallik P. MOTURI, VP Product and Business Development, Syntiant

Abstract (English)

This talk dives into the innovations crafting small, efficient Large Language Models (LLMs) for edge devices. By distilling the essence of LLMs, we achieve real-time processing, circumventing cloud latency and bandwidth issues. We’ll explore advancements in LLM architectures and the strategic optimizations that enable their miniaturization without sacrificing performance. The focus will be on how these developments not only reduce on-chip memory demands but also pave the way for bespoke hardware, tailored to enhance edge AI’s capabilities. Addressing the economic impact, we’ll discuss the direct correlation between memory footprint and chip cost, and how innovations in this realm are crucial for affordable AI deployment.

Schedule subject to change without notice.






Qualcomm Research, USA

Gian Marco IODICE



Johns Hopkins University

Danilo PAU

ST Microelectronics




Edge Impulse



Johns Hopkins University


MyVoice AI



Jose Angel Miranda CALERO

Embedded Systems Laboratory, EPFL

Yubei CHEN

UC Davis


Dibakar GOPE


Song HAN



Gian Marco IODICE


Mohamed KASSEM





UNICEF Learning Innovation Hub

Zechun LIU

Meta Reality Labs


Useful Sensors Inc.

Mallik P. MOTURI



Amazon Web Services




Qualcomm Technologies, Inc.


MyVoice AI


Lehigh University




University of Maryland

Vijay Janapa REDDI

Harvard University



NXP Semiconductors


Edge Impulse

Mohamed SHALAN

Efabless Corporation

Sai Krishna Revanth VARUMA

University of South Carolina


EE Times


Useful Sensors


( Click on a logo to get more information)