We are pleased to announce that we have added a new event to for 2021: the tinyML Research Symposium. Held in conjunction with the 2021 tinyML Summit, this Symposium will serve as the flagship event for research at the intersection of machine learning applications, algorithms, software, and hardware in deeply embedded machine learning systems. Speakers from academia and industry experts combining cross-layer innovations across topics will be featured.
Pacific Standard Time / UTC-8
8:00 am to 8:15 am
8:15 am to 10:00 am
Application / ML Model Design
Smartphone Impostor Detection with Behavioral Data Privacy and Minimalist Hardware Support
Guangyuan HU, PhD Student, Princeton University
Impostors are attackers who take over a smartphone and gain access to the legitimate user’s confidential and private information. This paper proposes a defense-in-depth mechanism that can detect impostors quickly with simple Deep Learning algorithms, which can achieve better detection accuracy than the best prior work which used Machine Learning algorithms requiring computation of multiple features. Different from previous work, we then consider protecting the privacy of a user’s behavioral (sensor) data by not exposing it outside the smartphone. For this scenario, we propose a Recurrent Neural Network (RNN) based Deep Learning algorithm that uses only the legitimate user’s sensor data to learn his/her normal behavior. We propose to use Prediction Error Distribution (PED) to enhance the detection accuracy. To make the on-device, real-time detection possible, we show how a minimalist hardware module, dubbed SID for Smartphone Imposter Detector, can be designed and integrated into smartphones for self-contained impostor detection. Experimental results show that SID can support real-time impostor detection, at a very low hardware cost and energy consumption, compared to other RNN accelerators.
TTVOS: Lightweight Video Object Segmentation with Adaptive Template Attention Module and Temporal Consistency Loss
Hyojin PARK, Ph.D. Student, Seoul National University
Semi-supervised video object segmentation (semi-VOS) is widely used in many applications. This task is tracking class-agnostic objects by a given segmentation mask. For doing this, various approaches have been developed based on online-learning, memory networks, and optical flow. These methods show high accuracy but are hard to be utilized in real-world applications due to slow inference time and tremendous complexity. To resolve this problem, template matching methods are devised for fast processing speed, sacrificing lots of performance. We introduce a novel semi-VOS model based on a temple matching method and a novel temporal consistency loss to reduce the performance gap from heavy models while expediting inference time a lot. Our temple matching method consists of short-term and long-term matching. The short-term matching enhances target object localization, while long-term matching improves fine details and handles object shape-changing through the newly proposed adaptive template attention module. However, the long-term matching causes error-propagation due to the inflow of the past estimated results when updating the template. To mitigate this problem, we also propose a temporal consistency loss for better temporal coherence between neighboring frames by adopting the concept of a transition matrix.
Hardware Aware Training for Efficient Keyword Spotting on General Purpose and Specialized Hardware
Peter BLOUW, Senior Research Scientist and Head of Advanced Projects, Applied Brain Research
Keyword spotting (KWS) provides a critical user interface for many mobile and edge applications, including phones, wearables, and cars. As KWS systems are typically ‘always on’, maximizing both accuracy and power efficiency are central to their utility. In this work we use hardware aware training (HAT) to build new KWS neural networks based on the Legendre Memory Unit (LMU) that achieve state-of-the-art (SotA) accuracy and low parameter counts. This allows the neural network to run efficiently on standard hardware (212 uW). We also characterize the power requirements of custom designed accelerator hardware that achieves SotA power efficiency of 8.79 uW, beating general purpose low power hardware (a microcontroller) by 24x and special purpose ASICs by 16x.
Characterization of Neural Networks Automatically Mapped on Automotive-grade Microcontrollers
Danilo PAU, IEEE R8 Action for Industry committee corresponding member for the year 2021, STMicroelectronics Italia
Neural networks and Machine Learning in general, represent today one of the greatest expectations for the realization of models that can serve to determine the behavior and operation of different physical systems. Undoubtedly the calculation resources necessary
for the training and the realization of the model are great especially if linked to the amount of data needed to detect the salient parameters of the model. At the same time, the models so obtained can be integrated on embedded systems, thanks to TinyML technologies, allowing to work exactly where the physical phenomena to analyze happen. In the consumer and industrial world these technologies have taken hold, and are also watched with interest by other sectors such as the automotiveworld. In this article we present a framework for the implementation of models based on neural networks on automotive family microprocessors, demonstrating their efficiency in two typical applications of the vehicle world: intrusion detection on the CAN bus communication network and the determination of the residual capacity of batteries for electric vehicles.
Memory-Efficient, Limb Position-Aware Hand Gesture Recognition using Hyperdimensional Computing
Andy ZHOU, PhD Student, University of California Berkeley
Electromyogram (EMG) pattern recognition can be used to classify hand gestures and movements for human-machine interface and prosthetics applications, but it often faces reliability issues resulting from limb position change. One method to address this is dual-stage classification, in which the limb position is first determined using additional sensors to select between multiple position-specific gesture classifiers. While improving performance, this also increases model complexity and memory footprint, making a dual-stage classifier difficult to implement in a wearable device with limited resources. In this paper, we present sensor fusion of accelerometer and EMG signals using a hyperdimensional computing model to emulate dual-stage classification in a memory-efficient way. We demonstrate two methods of encoding accelerometer features to act as keys for retrieval of position-specific parameters from multiple models stored in superposition. Through validation on a dataset of 13 gestures in 8 limb positions, we obtain a classification accuracy of up to 94.34%, an improvement of over 17.79% using a model trained solely on EMG. We achieve this while only marginally increasing memory footprint over a single limb position model, requiring 8x less memory than a traditional dual-stage classification architecture.
10:00 am to 10:15 am
10:15 am to 11:25 am
Privacy-Preserving Inference on the Edge: Mitigating a New Threat Model
Kartik PRABHU, MS Student, Stanford University
With the explosion of machine learning at the edge, there’s a major need for privacy-preserving machine learning for edge devices. As the number of devices in homes and other private spaces increases, we can expect to see more malicious actors exploiting the inherent recording capabilities in these systems to harm people. This paper proposes that machine learning is not the problem, but rather a solution, which along with the use of a secure enclave, offers a pragmatic approach to preserving privacy. With a simple programming framework, we show how machine learning application developers can be as productive as usual, while still keeping user data private. We demonstrate our implementation for privacy-preserving machine learning on an embedded system, the Nordic NRF5340 PDK with Arm Cotex-M-33, using a relatively large model for person-detection.
An Ultra-low Power RNN Classifier for Always-On Voice Wake- Up Detection Robust to Real-World Scenarios
Emmanuel HARDY, Research Engineer, CEA Leti
We present in this paper an ultra-low power (ULP) Recurrent Neural Network (RNN) based classifier for an always-on voice Wake-Up Sensor (WUS) with performances suitable for real-world applications. The purpose of our sensor is to bring down by at least a factor 100 the power consumption in background noise of always-on speech processing algorithms such as Automatic Speech Recognition, Keyword Spotting, Speaker Verification, etc. Unlike the other published approaches, we designed our wake-up sensor to be robust to unseen real-world noises for realistic levels of speech and noise by carefully designing the dataset and the loss function. We also specifically trained it to mark only the speech start rather than adopting a traditional Voice Activity Detection (VAD) approach. We achieve less than 3% No Trigger Rate (NTR) for a duty cycle less than 1% in challenging background noises pooled. We demonstrate the superiority of RNNs on this task compared to the other tested approaches, with an estimated power consumption of 30 nW in 65nm CMOS and a minimal memory footprint of 0.52 kB.
Resource Efficient Deep Reinforcement Learning for Acutely Constrained TinyML Devices
Filip SVOBODA, PhD Student, University of Oxford
The use of Deep Reinforcement Learning (Deep RL) in many resource constrained mobile systems has been limited in scope due to the severe resource consumption (e.g., memory, computation, energy) such approaches exert. As a result, TinyML devices ranging from sensors and cameras to small formfactor robots and drones have been unable to benefit from the advantages of recent Deep RL algorithms that have underpinned breakthrough results in applications of decision and control. In this work, we propose and study a variety of general-purpose techniques designed to lower such system resource bottlenecks for Deep RL by optimizing both the agent algorithms and neural architectures used in these solutions. Experiments show our Deep RL optimization framework that combines these techniques is able produce significant efficiency gains to the point such techniques become feasible for TinyML platforms. We present one representative end-to-end application (viz. network protocol learning) executing on constrained processors (embedded-hardware), in addition to simulated control problems addressed assuming limited access to system resources.
Green Accelerated Hoeffding Tree
Eva GARCIA-MARTIN, Data Scientist, Ekkono Solutions
State-of-the-art machine learning solutions mainly focus on creating highly accurate models without constraints on hardware resources. Stream mining algorithms are designed to run on resource-constrained devices, thus a focus on low power and energy and memory-efficient is essential. The Hoeffding tree algorithm is able to create energy-efficient models, but at the cost of less accurate trees in comparison to their ensembles counterpart. Ensembles of Hoeffding trees, on the other hand, create a highly accurate forest of trees but consume five times more energy on average. An extension that tried to obtain similar results to ensembles of Hoeffding trees was the Extremely Fast Decision Tree (EFDT).
This paper presents the Green Accelerated Hoeffding Tree (GAHT) algorithm, an extension of the EFDT algorithm with a lower energy and memory footprint and the same (or higher for some datasets) accuracy levels. GAHT grows the tree setting individual splitting criteria for each node, based on the distribution of the number of instances over each particular leaf. The results show that GAHT is able to achieve the same competitive accuracy results compared to EFDT and ensembles of Hoeffding trees while reducing the energy consumption up to 70\%.
TENT: Efficient Quantization of Neural Networks on the tiny Edge with Tapered FixEd PoiNT
Hamed Fatemi LANGROUDI, Research Fellow, University of Texas at San Antonio
Vedant KARIA, PhD student, University of Texas at San Antonio
In this research, we propose a new low-precision framework, TENT, to leverage the benefits of a tapered fixed-point numerical format in TinyML models. We introduce a tapered fixed-point quantization algorithm that matches the numerical format’s dynamic range and distribution to that of the deep neural network model’s parameter distribution at each layer. An accelerator architecture for the tapered fixed-point with TENT framework is proposed. Results show that the accuracy on classification tasks improves up to ~31% with an energy overhead of ~17-30% as compared to fixed-point, for ConvNet and ResNet-18 models.
A VM/Containerized Approach for Scaling TinyML Applications
Meelis LOOTUS, ML Engineer, doc.ai
Although deep neural networks are typically computationally expensive to use, technological advances in both the design of hardware platforms and of neural network architectures, have made it possible to use powerful models on edge devices. To enable widespread adoption of edge based machine learning, we introduce a set of open-source tools that make it easy to deploy, update and monitor machine learning models on a wide variety of edge devices. Our tools bring the concept of containerization to the TinyML world. We propose to package ML and application logic as containers called Runes to deploy onto edge devices. The containerization allows us to target a fragmented Internet-of-Things (IoT) ecosystem by providing a common platform for Runes to run across devices.
Does Form Follow Function? An Empirical Exploration of the Impact of Deep Neural Network Architecture Design on Hardware-Specific Acceleration
Saad ABBASI, PhD Student, University of Waterloo
Advances in deep learning during the last decade have led to state-of-the-art performance across a wide variety of tasks. However, one of the biggest challenges with the widespread adoption of deep learning in an operational manner has been high computational complexity. This challenge is particularly important to tackle given the recent proliferation of smart sensors and edge devices. This has led to hardware-specific acceleration platforms designed specifically to accelerate deep neural network inference based on microprocessor architectural traits. While promising, the degree of inference speed gains achieved via hardware-specific acceleration can vary significantly depending on the design of the deep neural network architecture and the microprocessor being leveraged for inference. The fine-grained relationship between form and function with respect to deep neural network architecture design and hardware-specific acceleration is one area that is not well studied in the research literature, with form often dictated by accuracy as opposed to hardware function. In this study, a comprehensive empirical exploration is conducted to investigate the impact of deep neural network architecture design on the degree of inference speedup that can be achieved via hardware-specific acceleration. More specifically, we empirically study the impact of a variety of commonly used macro-architecture design patterns across different architectural depths through the lens of OpenVINO microprocessor-specific and GPU-specific acceleration. Experimental results showed that while leveraging hardware-specific acceleration achieved an average inference speed-up of 380%, the degree of inference speed-up varied drastically depending on the macro-architecture design pattern, with the greatest speedup achieved on the depthwise bottleneck convolution design pattern at 550%. Furthermore, we conduct an in-depth exploration of the correlation between FLOPs requirement, level 3 cache efficacy, and network latency with increasing architectural depth and width. Finally, we analyze the inference time reductions using hardware-specific acceleration when compared to native deep learning frameworks across a wide variety of hand-crafted deep convolutional neural network architecture designs as well as ones found via neural architecture search strategies. We found that the DARTS-derived architecture to benefit from the greatest improvement from hardware acceleration (1200%) while the depthwise bottleneck convolution-based MobileNet-V2 to have the lowest overall inference time of around 2.4 ms. These findings illustrate the importance of tailoring network architecture design to account for the intricacies of microprocessor architecture traits to enable greater hardware-specific acceleration.
Deep Learning for Compute in Memory
Matthias REISSER, Senior Engineer, Qualcomm Inc
Compute in Memory (CIM) accelerators for neural networks promise large efficiency gains, allowing for deep learning applications on extremely resource-constrained devices. Compared to classical digital processors, computations on CIM accelerators are subject to a variety of noise sources such as process variations, thermal effects, quantization, and more. In this work, we show how fundamental hardware design choices influence the predictive performance of neural networks and how training these models to be hardware-aware can make them more robust for CIM deployment. Through various experiments, we make the trade-offs between energy efficiency and model capacity explicit and showcase the benefits of taking a systems view on CIM accelerator and neural network training co-design.
11:25 am to 11:40 am
11:40 am to 1:20 pm
ML Infrastructure / Optimizations
hls4ml: An Open-Source Co-Design Workflow to Empower Scientific Low-Power Machine Learning Devices
Farah FAHIM, Deputy Head of Quantum Science, Fermilab
Accessible machine learning algorithms, software, and diagnostic tools for energy-efficient devices and systems are extremely valuable across a broad range of application domains. In scientific domains, real-time near-sensor processing can drastically improve experimental design and accelerate scientific discoveries. We have developed hls4ml, an open-source software-hardware co-design workflow to interpret and translate machine learning algorithms for implementation in FPGAs and ASICs specifically to support domain scientists. In this paper, we describe the essential features of the hls4ml workflow including network optimization techniques—such as pruning and quantization-aware training—which can be incorporated naturally into the device implementations. We expand on previous hls4ml work by extending capabilities and techniques towards low-power implementations and increased usability: new Python APIs, quantization-aware pruning, end-to-end FPGA workflows, long pipeline kernels for low power, and new device backends include an ASIC workflow. Taken together, these and continued efforts in hls4ml will arm a new generation of domain scientists with accessible, efficient, and powerful tools for machine-learning-accelerated discovery.
Compiler Toolchains for Deep Learning Workloads on Embedded Platforms
Max SPONNER, ,
As the usage of deep learning becomes increasingly popular in mobile and embedded solutions, it is necessary to convert the framework-specific network representations into executable code for these embedded platforms. This paper starts with a survey and benchmark of the available open source deep learning compiler toolchains, which focuses on the capabilities and performance of the toolchains in regard to targeting embedded microcontrollers that are combined with a dedicated accelerator in a heterogeneous fashion. The second part focuses on the implementation and evaluation of a compilation flow that targets such a solution and reuses one of the existing toolchains to demonstrate the necessary steps for hardware developers to build a software flow for their product.
Quantization-Guided Training for Compact TinyML Models
Sek CHAI, Co-founder and CTO, Latent AI
We address the methodology to train and quantize deep neural networks (DNNs) in order to produce compact models while maintaining algorithmic accuracy. In this paper, we propose a Quantization Guided Training (QGT) method to guide DNN training towards optimized low-bit-precision targets, and reach extreme compression levels below 8-bit precision. Unlike standard quantization-aware training (QAT) approaches, QGT uses customized regularization to encourage weight values towards a distribution that maximizes accuracy while reducing quantization errors. We validate QGT using state-of-the-art model architectures (MobileNet, ResNet) on vision datasets. We also demonstrate the effectiveness with an 81KB tiny model for person detection down to 2-bit precision (representing 17.7x size reduction), while maintaining an accuracy drop of only 3% compared to a floating-point baseline.
SWIS – Shared Weight bIt Sparsity for Efficient Neural Network Acceleration
Shurui LI, PhD Student, UCLA
Quantization is spearheading the increase in performance and efficiency of neural network computing systems making headway into commodity hardware. We present SWIS – Shared Weight bIt Sparsity, a quantization framework for efficient neural network inference acceleration delivering improved performance and storage compression through an offline weight decomposition and scheduling algorithm. SWIS can achieve up to 52% (19.8%) point accuracy improvement when quantizing MobileNet-v2 to 4 (2) bits post-training (with retraining) showing the strength of leveraging shared bit-sparsity in weights. SWIS accelerator gives up to 6X speedup and 1.8X energy improvement over state of the art bit-serial architectures.
Hypervector Design for Efficient Hyperdimensional Computing on Edge Devices
Toygun BASAKLAR, Phd Student, University of Wisconsin-Madison
Hyperdimensional computing (HDC) has emerged as a new light-weight learning algorithm with smaller computation and energy requirements compared to conventional techniques. In HDC, data points are represented by high dimensional vectors (hypervectors), which are mapped to high dimensional space (hyperspace). Typically, a large hypervector dimension (>= 1000) is required to achieve accuracies comparable to conventional alternatives. However, unnecessarily large hypervectors increase hardware and energy costs, which can undermine their benefits. This paper presents a technique to minimize the hypervector dimension while maintaining the accuracy and improving the robustness of the classifier. To this end, we formulate hypervector design as a multi-objective optimization problem for the first time in the literature. The proposed approach decreases the hypervector dimension by more than 128x while maintaining or increasing the accuracy achieved by conventional HDC. Experiments on a commercial hardware platform show that the proposed approach achieves more than two orders of magnitude reduction in model size, inference time, and energy consumption. We also demonstrate the trade-off between accuracy and robustness to noise and provide Pareto front solutions as a design parameter in our hypervector design.
Schedule subject to change without notice.
Vijay JANAPA REDDI
Facebook Reality Labs
University of California
H. T. KUNG
University of Maryland Baltimore County
Arizona State University
University of Michigan
GrAI Matter Labs
University of Cyprus