Research Symposium - March 27, 2023

The tinyML research symposium serves as a flagship venue for related research at the intersection of machine learning applications, algorithms, software, and hardware in deeply embedded machine learning systems.

March 27, 2023

About

The Symposium will be held in conjunction with the tinyML Summit 2023, the premier annual gathering of senior level technical experts and decision makers representing fast growing global tinyML community.

Tiny machine learning (tinyML) is a fast-growing field of machine learning technologies enabling on-device sensor data analytics at extremely low power, typically in the milliwatt range and below. The tinyML ecosystem is fueled by (i) emerging commercial applications and new systems concepts on the horizon; (ii) significant progress on algorithms, networks, and models down to 100 kB and below; and (iii) current low-power applications in vision and audio that are already becoming mainstream and commercially available. There is growing momentum demonstrated by technical progress and ecosystem development in all of these areas. The tinyML research symposium serves as a flagship venue for related research at the intersection of machine learning applications, algorithms, software, and hardware in deeply embedded machine learning systems.

The tinyML Research Symposium is held in conjunction with the tinyML Summit, the premier annual gathering of senior level technical experts and decision makers representing fast growing global tinyML community.

Venue

Hyatt Regency San Francisco Airport

1333 Bayshore Highway, Burlingame, CA 94010

Contact us

Bette COOPER

Rosina HABERL

Further Committee members are:

Farshad Akbari

Infineon

Shaahin Angizi

New Jersey Institute of Technology

Kartikeya Bhardwaj

Qualcomm

Alessio Burrello

University of Bologna

Francesco Conti

University of Bologna, IT

Federico Corradi

TU Eidhoven, NL

Marco Donato

Tufts

Deliang fan Fan

ASU

Elisabetta Farella

Fondazione Bruno Kessler, IT

Jeremy Holleman

University of North Carolina, Charlotte

Mohsen Imani

UC Irvine

Hun Seok Kim

Univ of Michigan

Niall Lyons

Infineon

Michele Magno

ETH Zurich, CH

Guilherme Paim

INESC-ID, Lisbon, PT

Priya Panda

Yale

Abbas Rahimi

IBM Research, CH

Arman Roohi

University of Nebraska

Avik Santra

Infineon

Urmish Thakker

SambaNova Systems

Schedule

7:30 am to 9:00 am

Registration and continental breakfast

9:00 am to 9:15 am

Welcome and Opening Statement

9:15 am to 10:00 am

Keynote

Dr. Alex Jones NSF CISE Program Manager

10:00 am to 11:00 am

tinyML Applications and Systems

MetaLDC: Meta Learning of Low-Dimensional Computing Classifiers for Fast On-Device Adaption

Yejia LIU, PhD Student, University of California Riverside

Abstract (English)

Fast model updates for unseen tasks on intelligent edge devices are crucial but also challenging due to the limited computational power. In this paper, we propose MetaLDC, which meta trains brain- inspired ultra-efficient low-dimensional computing classifiers to enable fast adaptation on tiny devices with minimal computational costs. Concretely, during the meta-training stage, MetaLDC meta trains a representation offline by explicitly taking into account that the final (binary) class layer will be fine-tuned for fast adaptation for unseen tasks on tiny devices; during the meta-testing stage, MetaLDC uses closed-form gradients of the loss function to enable fast adaptation of the class layer. Unlike traditional neural networks, MetaLDC is designed based on the emerging LDC framework to en- able ultra-efficient inference. Our experiments have demonstrated that compared to SOTA baselines, MetaLDC has a better accuracy, robustness against random bit errors as well as cost efficiency on hardware computation.

Memory-Oriented Design-Space Exploration of Edge-AI Hardware for XR Applications

Manan SURI , Leader of NVM and Neuromorphic Hardware Research group, Indian Institute of Technology Delhi

Abstract (English)

Low-Power Edge-AI capabilities are essential for on-device extended reality (XR) applications to support the vision of Metaverse. In this work, we investigate two representative XR workloads: (i) Hand detection and (ii) Eye segmentation, for hardware design space exploration. For both applications, we train deep neural networks and analyze the impact of quantization and hardware specific bottlenecks. Through simulations, we evaluate a CPU and two systolic inference accelerator implementations. Next, we compare these hardware solutions with advanced technology nodes. The impact of integrating state-of-the-art emerging non-volatile memory technology (STT/SOT/VGSOT MRAM) into the XR-AI inference pipeline is evaluated. We found that significant energy benefits (≥24%) can be achieved for hand detection (IPS=10) and eye segmentation (IPS=0.1) by introducing non-volatile memory in the memory hierarchy for designs at 7nm node while meeting minimum IPS (inference per second). Moreover, we can realize substantial reduction in area (≥30%) owing to the small form factor of MRAM compared to traditional SRAM.

  • YouTube

TinyRCE Forward Learning under Tiny Constraints

Danilo PAU, Technical Director, IEEE and ST Fellow, STMicroelectronics

Abstract (English)

The challenge posed by on-tiny-devices learning targeting ultra-low power embedded devices has recently attracted several machine learning researchers. A typical model learning session requires real time streams of data produced by a plethora of heterogeneous sensors. In such a context, this paper proposes a forward-only learning approach based on a hyper-spherical classifier with the aim to be deployed on microcontrollers and sensors. This algorithm drives the learning process using streams of data associated to categories that shall be learnt by the proposed method. The classifier is fed with features extracted by a convolutional neural network belonging to the extreme learning machine class. The weights of the topology were randomly initialized. Alternatively, it could be trained offline with backpropagation. Its weights then are stored in a tiny read only memory of 76.45 KiBytes. The classifier itself required up to 40.2 KiBytes of RAM to perform a complete on-device learning workload in 0.216sec running on a MCU clocked at 480MHz. A forget mechanism has been introduced to discard redundant neurons, placed into the classifier’s hidden layer, if they would become useless over time. This classifier has been evaluated with a new interleaved training and testing protocol to mimic a forward on-tiny-device learning workload. It has been tested with openly available datasets shaping human activity monitoring (PAMAP2, SHL) and ball-bearing anomaly detection (CWRU) case studies. Experiments have shown that the proposed TinyRCE classifier performed competitively against to a supervised convolutional topology followed by a SoftMax classifier trained with backpropagation procedure on all three datasets.

  • YouTube

11:00 am to 11:30 am

Break

11:30 am to 12:10 pm

Short papers

Session Chair: Arman ROOHI, Assistant Professor, School of Computing, University of Nebraska-Lincoln

Design and analysis of hardware friendly pruning algorithms to accelerate deep neural networks at the edge

Christin BOSE, PhD Candidate, Purdue University

Abstract (English)

Employing unstructured pruning is very popular to get state-of-the-art model compression of convolutional neural network weights. While unstructured pruning results in good model accuracy, it is challenging for hardware architectures to exploit unstructured sparsity for speedups and power savings during model inference. While it is more challenging to do structured pruning for a given model accuracy, structured pruning readily translates to model inference speedups and power reduction due to its hardware-friendly nature. Model pruning is performed by removing network weights according to some criteria. While there are several structured and unstructured pruning criteria being proposed in literature, it is often unclear which criteria offers the best performance in hardware. In this work, we generate the accuracy-sparsity-latency pareto curve for several state-of-the-art filter pruning criteria and generate insights based on an edge-based DNN accelerator. We also propose combining various granularities of pruning and evaluate the benefits. These insights will be useful to prune deep learning workloads for inference at the edge subject to a given accuracy and compute budget.

FMAS: Fast Multi-Objective SuperNet Architecture Search for Semantic Segmentation

Brett MEYER, Associate Professor in the Department of Electrical and Computer Engineering, McGill University

Abstract (English)

We present FMAS, a fast multi-objective neural architecture search framework for semantic segmentation. FMAS subsamples the structure and pre-trained parameters of DeepLabV3+, without finetuning, dramatically reducing training time during search. To further reduce candidate evaluation time, we use a subset of the validation dataset during the search. Only the final, Pareto non-dominated, candidates are ultimately fine-tuned using the complete training set. We evaluate FMAS by searching for models that effectively trade accuracy and computational cost on the PASCAL VOC 2012 dataset. FMAS finds competitive designs quickly, e.g., taking just 0.5 GPU days to discover a DeepLabV3+ variant that reduces FLOPs and parameters by 10% and 20% respectively, for less than 3% increased error. We also search on an edge device called GAP8 and use its latency as the metric. FMAS is capable of finding 2.2× faster network with 7.61% MIoU loss.

  • YouTube

Classification of Depth-Map Image on AI Microcontroller

Jesse SANTOS, Software Engineer, Analog Devices

Abstract (English)

Depth maps are known to help enhancing the visual tasks. In this paper, we consider image classification using depth maps only. The classification task is done on an edge device called MAX78000 artificial intelligence (AI) accelerator. We trained a SimpleNet deep learning model using the generated depth-map images from a publicly available point-cloud dataset. The training was quantization-aware, targeted for MAX78000. The achieved F1-score for 10 classes was 60 to 96 percent. We consider to test the trained model on a dataset that is unknown to it. Thus, we captured depth map images using a Time-of-Flight (ToF) camera to create a dataset for inference on the AI microcontroller.

  • YouTube

SSS3D: Fast Neural Architecture Search For Efficient Three-Dimensional Semantic Segmentation

Olivier THERRIEN, M.Sc. graduate, McGill University

Abstract (English)

We present SSS3D, a fast multi-objective NAS framework designed to find computationally efficient 3D semantic scene segmentation networks. It uses RandLA-Net, an off-the-shelf point-based network, as a super-network to enable weight sharing and reduce search time by 99.67% for single-stage searches. SSS3D has a complex search space composed of sampling and architectural parameters that can form 2.88 × 10^17 possible networks. To further reduce search time, SSS3D splits the complete search space and introduces a two-stage search that finds optimal subnetworks in 54% of the time required by single-stage searches.

  • YouTube

Device management and network connectivity as missing elements in TinyML landscape

Marcin NAGY, Product Director, IoT, AVSystem

Abstract (English)

Deployment of solutions based on TinyML requires meeting several challenges. These include hardware heterogeneity, microprocessor (MCU) architectures, and resource availability constraints. Another challenge is the variety of operating systems for MCU, limited memory management implementations and limited software interoperability between devices. A number of these challenges are solved by dedicated programming libraries and the ability to compile code for specific devices. Nevertheless, the challenge discussed in the paper is the issue of network connectivity for such solutions. We point out that more emphasis should be placed on standard protocols, interoperability of solutions and security. Finally, the paper discusses how the LwM2M protocol can solve the identified challenges related to network connectivity and interoperability.

  • YouTube

12:10 pm to 1:20 pm

Lunch

1:20 pm to 2:40 pm

tinyML Algorithms

Session Chair: Zain ASGAR, Adjunct Professor Of Computer Science, Stanford University

AugViT: Improving Vision Transformer Training by Marrying Attention and Data Augmentation

Zhongzhi YU, PhD Student, EIC Lab at Georgia Institute of Technology

Abstract (English)

Despite the impressive accuracy of large-scale vision transformers (ViTs) across various tasks, it remains a challenge for small-scale ViTs (e.g., < 1G inference floating points operations (FLOPs) as in LeViT) to significantly outperform state-of-the-art convolution neural networks (CNNs) in terms of the accuracy-efficiency trade-off, limiting their wider application, especially on resource-constrained devices. As analyzed in recent works, selecting an effective data augmentation technique can non-trivially improve the accuracy of small-scale ViTs. However, whether existing mainstream data augmentation techniques dedicated to CNNs are optimal for ViTs is still an open question. To this end, we propose a data augmentation framework called AugViT, which is dedicated to incorporating the key component in ViTs, i.e., self-attention, into data augmentation intensity to enable ViT’s outstanding performance across various devices. Specifically, motivated by ViT’s patch-based processing pipeline, our proposed AugViT integrates (1) a dedicated scheme for mapping the attention map in ViTs to the suggested augmentation intensity for each patch, (2) a simple but effective strategy of selecting the most effective attention map within ViTs to guide the aforementioned attention-aware data argumentation, and (3) a set of patch-level augmentation techniques that matches the patch-aware processing pipeline and enables the varying of augmentation intensities in each patch. Extensive experiments and ablation studies on two datasets and ten representative ViT models validate AugViT’s effectiveness in boosting ViTs’ performance, especially for small-scale ViTs, e.g., improving LeViT-128S’s accuracy from 76.6% to 77.1%, achieving a comparable accuracy to EfficientNet-B0 with 21.8% fewer inference FLOPs overhead on ImageNet dataset).

Automatic Network Adaptation for Ultra-Low Uniform-Precision Quantization

Tae-Ho KIM, Co-founder and CTO, Nota AI

Abstract (English)

Uniform-precision neural network quantization has gained popularity since it simplifies densely packed arithmetic unit for high computing capability. However, it ignores heterogeneous sensitivity to the impact of quantization errors across the layers, resulting in sub-optimal inference accuracy. This work proposes a novel neural architecture search called \textit{neural channel expansion} that adjusts the network structure to alleviate accuracy degradation from ultra-low uniform-precision quantization. The proposed method selectively expands channels for the quantization sensitive layers while satisfying hardware constraints (e.g., FLOPs, PARAMs). Based on in-depth analysis and experiments, we demonstrate that the proposed method can adapt several popular networks’ channels to achieve superior 2-bit quantization accuracy on CIFAR10 and ImageNet. In particular, we achieve the best-to-date Top-1/Top-5 accuracy for 2-bit ResNet50 with smaller FLOPs and the parameter size.

  • YouTube

Fused Depthwise Tiling for Memory Optimization in TinyML Deep Neural Network Inference

Rafael STAHL , Doctoral candidate, Technical University of Munich

Abstract (English)

Memory optimization for deep neural network (DNN) inference gains high relevance with the emergence of TinyML, which refers to the deployment of DNN inference tasks on tiny, low-power microcontrollers. Applications such as audio keyword detection or radar-based gesture recognition are heavily constrained by the limited memory on such tiny devices because DNN inference requires large intermediate run-time buffers to store activations and other intermediate data, which leads to high memory usage. In this paper, we propose a new Fused Depthwise Tiling (FDT) method for the memory optimization of DNNs, which, compared to existing tiling methods, reduces memory usage without inducing any run time overhead. FDT applies to a larger variety of network layers than existing tiling methods that focus on convolutions. It improves TinyML memory optimization significantly by reducing memory of models where this was not possible before and additionally providing alternative design points for models that show high run time overhead with existing methods. In order to identify the best tiling configuration, an end-to-end flow with a new path discovery method is proposed, which applies FDT and existing tiling methods in a fully automated way, including the scheduling of the operations and planning of the layout of buffers in memory. Out of seven evaluated models, FDT achieved significant memory reduction for two models by 76.2% and 18.1% where existing tiling methods could not be applied. Two other models showed a significant run time overhead with existing methods and FDT provided alternative design points with no overhead but reduced memory savings.

  • YouTube

Data Aware Neural Architecture Search

Emil NJOR, PhD Student, Technical University of Denmark

Abstract (English)

Neural Architecture Search (NAS) is a popular tool for automatically generating Neural Network (NN) architectures. In early NAS works, these tools typically optimized NN architectures for a single metric, such as accuracy. However, in the case of resource constrained Ma- chine Learning, one single metric is not enough to evaluate a NN architecture. For example, a NN model achieving a high accuracy is not useful if it does not fit inside the flash memory of a given system. Therefore, recent works on NAS for resource constrained systems have investigated various approaches to optimize for multiple met- rics. In this paper, we propose that, on top of these approaches, it could be beneficial for NAS optimization of resource constrained systems to also consider input data granularity. We name such a system “Data Aware NAS”, and we provide experimental evidence of its benefits by comparing it to traditional NAS.

  • YouTube

2:40 pm to 3:00 pm

Break

3:00 pm to 4:20 pm

tinyML Hardware

Session Chair: Wolfgang FURTNER, Distinguished Engineer System Architecture, Infineon Technologies

Session Chair: Huichu LIU, Research Scientist , Meta

Benchmarking and modeling of analog and digital SRAM in-memory computing architectures

Pouya HOUSHMAND, PhD Student, KU Leuven

Abstract (English)

In-memory-computing is emerging as an efficient hardware paradigm for deep neural network accelerators at the edge, enabling to break the memory wall and exploit massive computational parallelism. Two design models have surged: analog in-memory-computing (AIMC) and digital in-memory-computing (DIMC), offering a different design space in terms of accuracy, efficiency and dataflow flexibility. This paper targets the fair comparison and benchmarking of both approaches to guide future designs, through a.) an overview of published architectures; b.) an analytical cost model for energy and throughput; c.) scheduling of workloads on a variety of modeled IMC architectures for end-to-end network efficiency analysis, offering valuable workload-hardware co-design insights.

MEMA Runtime Framework: Minimizing External Memory Accesses for TinyML on Microcontrollers

Vikas NATESH, PhD Student, Harvard University

Andrew SABOT, PhD Student, Harvard University

Abstract (English)
We present the MEMA framework for the easy and quick derivation of efficient inference runtimes that \underline{m}inimize \underline{e}xternal \underline{m}emory \underline{a}ccesses for matrix multiplication on TinyML systems. The framework accounts for hardware resource constraints and problem sizes in analytically determining optimized schedules and kernels that minimize memory accesses. MEMA provides a solution to a well-known problem in the current practice, that is, optimal schedules tend to be found only through a time consuming and heuristic search of a large scheduling space. We compare the performance of runtimes derived from MEMA to existing state-of-the-art libraries on ARM-based TinyML systems. For example, for neural network benchmarks on the ARM Cortex-M4, we achieve up to a 1.8x speedup and 44\% energy reduction over CMSIS-NN.
  • YouTube

Training Neural Networks for Execution on Approximate Hardware

Tianmu LI , PhD Student, UCLA

Abstract (English)

Approximate computing methods have shown great potential for deep learning. Due to the reduced hardware costs, these methods are especially suitable for inference tasks on battery-operated devices that are constrained by their power budget. However, approximate computing hasn’t reached its full potential due to the lack of work on training methods. In this work, we discuss training methods for approximate hardware. We demonstrate how training needs to be specialized for approximate hardware, and propose methods to speed up the training process by up to 18X.

  • YouTube

How Tiny Can Analog Filterbank Features Be Made for Ultra-low-power On-device Keyword Spotting?

Subhajit RAY, PhD Student, Columbia University, New York

Abstract (English)

Analog feature extraction is a power-efficient and re-emerging signal processing paradigm for implementing the front-end feature extractor in on-device keyword-spotting systems. Despite its power-efficiency and re-emergence, there is little consensus on what values the architectural parameters of its critical block, the analog filterbank, should be set to, even though they strongly influence power consumption. Towards building consensus and approaching fundamental power consumption limits, we find via simulation that through careful selection of its architectural parameters, the power of a typical state-of-the-art analog filterbank could be reduced by 33.6x, while sacrificing only 1.8% in downstream 10-word keyword spotting accuracy through a back-end neural network.

  • YouTube

4:20 pm to 5:00 pm

The Power of Diversity in machine learning

Chhavi JAIN, Chief Executive Officer , Live AI Dream

Vijay JANAPA REDDI, Associate Professor, Harvard University

In this talk, we will delve into necessity of having diversity of data, perspectives and experiences in machine learning. We will explore how this diversity can bring about new ideas and insights, foster innovation, and enable us to create more robust and effective ML models.

Additionally, we will highlight the importance of ensuring that all voices are heard and included in the development of ML, in order to avoid costly product failures and to build more equitable and inclusive AI systems that can become customer’s favorite products in future.  Join us to explore the possibilities of tomorrow.

Schedule subject to change without notice.

Committee

Marian VERHELST

General Co-Chair

KU Leuven

Vijay JANAPA REDDI

General Co-Chair

Harvard University

Tinoosh MOHSENIN

Program Co-Chair

University of Maryland Baltimore County

Paul WHATMOUGH

Program Co-Chair

Qualcomm

Boris MURMANN

Steering Committee

Stanford University

Zain ASGAR

Publications Chair

Stanford University

Theocharis THEOCHARIDES

Publicity Chair

University of Cyprus

Wolfgang FURTNER

Infineon Technologies

Arman ROOHI

School of Computing, University of Nebraska-Lincoln

Huichu LIU

Meta

Speakers

Alex JONES

Keynote

National Science Foundation CISE

Christin BOSE

Purdue University

Pouya HOUSHMAND

KU Leuven

Chhavi JAIN

Live AI Dream

Tae-Ho KIM

Nota AI

Tianmu LI

UCLA

Yejia LIU

University of California Riverside

Brett MEYER

McGill University

Marcin NAGY

AVSystem

Vikas NATESH

Harvard University

Emil NJOR

Technical University of Denmark

Danilo PAU

STMicroelectronics

Subhajit RAY

Columbia University, New York

Andrew SABOT

Harvard University

Jesse SANTOS

Analog Devices

placeholder

Rafael STAHL

Technical University of Munich

Manan SURI

Indian Institute of Technology Delhi

Olivier THERRIEN

McGill University

Zhongzhi YU

EIC Lab at Georgia Institute of Technology

Sponsors

( Click on a logo to get more information)