About tinyML Asia
Machine learning (ML) is at the forefront of providing artificial intelligence to all aspects of computing. It is the technology powering many of today’s advanced applications from image recognition to voice interfaces to self-driving vehicles and beyond. Many of these initial ML applications require significant computational resources most often found in cloud-scale data centers. To enable industry usage and adoption, it is, therefore, necessary to significantly reduce the power consumed to bring applications to end devices at the cloud edge (smartphones, wearables, vehicles, IoT devices, etc.) and to reduce the load on required data center resources.
tinyML Asia Technical Forum 2022 will be held on November 29-30, 2022 from 9 to 11:30 am (China Standard Time, UTC+8) each day. The online workshop will be focused on applications, end users, and supply chain for tiny ML from both a global and Asian perspective. Unlike other existing big industry and academic events that lack focus on low power ML solutions, tinyML events cover the entire ecosystem bringing industry and academia together.
China Standard Time (CST) / UTC+8
9:00 am to 9:05 am
Opening / Welcome
9:05 am to 9:40 am
Biology-inspired Space Imaging with Neuromorphic Systems
Gregory COHEN, Associate Professor of Neuromorphic Systems deputy director of the International Centre for Neuromorphic Systems (ICNS) , Western Sydney University
9:40 am to 10:05 am
Software Driven TinyML Hardware Co-Design
Weier WAN, Head of Software-Hardware Co-design, Aizip
Software-hardware co-design is at the heart to enable efficient TinyML system. Over the past few years, the TinyML community has made tremendous progress: TinyML software developers have invented techniques to make AI models smaller, consume less memory, and requires less computation; TinyML hardware developers have designed accelerators chips that significantly lower inference latency and power consumption on TinyML benchmarks (e.g. MLPerf).
To realize further leap in efficiency, TinyML software and hardware should be more closely co-optimized from end-to-end. Accelerator hardware should be designed not just for models and metrics in public benchmarks, but also for robust models used in production and requirements from real-world tasks. Similarly, models and software should be optimized not just for the number of parameters and memory size, but also for full-system-level efficiency and cost.
Aizip is enabling the future of software-hardware co-design by offering accelerator co-design services to IC companies. The service leverages Aizip’s rich family of production quality TinyML models spanning vision, audio and time series applications. Aizip further co-designs accelerator hardware with its IC partners to most efficiently support these models, thereby combining the best of both worlds.
To reach the next-level efficiency, Aizip’s co-design service focuses on the processing-in-memory (PIM) AI accelerators. PIM accelerators significantly reduce AI inference power consumption by eliminating power-hungry data movements in conventional AI hardware. Aizip’s full-stack PIM co-design services includes silicon-verified PIM IPs, chip architecture design, PIM-optimized neural networks – PIMNets, and PIM-aware training frameworks. This full stack co-design ensures PIM products to deliver superior system-level energy-efficiency, while not compromising accuracy or robustness compared to digital processors.
10:05 am to 10:30 am
Sensor fusion on low power AI pre processor
James WANG, Chief Strategy & Sales Officer , eYs3D
Sensor fusion combines sensor data or data derived from disparate sources so that the resulting information has less uncertainty than possible when these sources are used individually. For sensor fusion technology to meet the market demand of real-time and low-latency levels, a “centralized fusion approach” is adopted. The clients simply forward all the data to a central location, and some entity at the central location is responsible for correlating and fusing the data. In this session, we discuss the approaches to utilizing ARM NPU & CPU In running algorithms and ML adoption with routed digital ASICS designs.
10:30 am to 10:40 am
tinyRadar for fitness: A radar-based contactless activity tracker for edge computing
Satyapreet SINGH, Masters Student, NeuRonICS Lab, Indian Institute of Science
Exercising regularly is essential for maintaining a healthy lifestyle. Studies have shown that knowing parameters such as calories burnt, step counts, and miles travelled helps track fitness goals and maintain motivation . Modern fitness trackers aim to provide these metrics in real-time, with features like ubiquitous connectivity and lightweight. Most fitness trackers are either wearables or camera-based. Wearable fitness trackers can cause discomfort during exercise, and exchanging users’ data over the internet poses a privacy threat  . The camera-based fitness trackers also often pose a privacy risk. Certain works prevent this by hiding the face of users through complex algorithms, adding to the computation cost requiring high-end processing computers to make readings available in real-time . We propose tinyRadar as a contactless fitness tracker which identifies the exercise performed by the user and computes its repetition counts. It comes in a small form factor and preserves the user’s privacy as it provides point cloud data, making it a suitable choice for a fitness tracker in a smart home setting.
As can be seen from Figure 1, it comprises a Texas Instruments (TI) IWR1843 mmWave radar board as a sensing and processing modality and an ESP32 module for results transmission to the user’s smartphone through Bluetooth Low Energy (BLE). The mmWave radar processes the information received from the target environment to create a Velocity-Time map which contains unique signatures for different human activities. A three-layered Convolutional Neural Network (CNN) deployed on the radar board classifies the exercise performed by the user with the Velocity-Time map as input in one of the following eight categories: Crossover toe touch, Crunches, Jogging, Lateral squats, Lunges, Hand rotation, Squats, and Rest. The repetition count algorithm runs parallelly on board to compute the repetition counts. It provides a real-time subject-independent classification accuracy of 97% and repetition counts with an accuracy of ±4 counts.
- Fitnessarmbaender – Nur zwei von zwoelf sind gut, test.de, December 27, 2015. Retrieved onJanuary 6, 2016
- Sly, Liz (29 January 2018). ”U.S. soldiers are revealing sensitive and dangerous information byjogging”. The Washington Post. Retrieved 29 January 2018.
- Rushil Khurana, Karan Ahuja, Zac Yu, Jennifer Mankoff, Chris Harrison, and Mayank Goel. 2018.GymCam: Detecting, Recognizing and Tracking Simultaneous Exercises in Unconstrained Scenes. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2, 4, Article 185 (December 2018), 17 pages. https://doi.org/10.1145/3287063
Figure 1: tinyRadar as a contactless fitness tracker. (a) IWR1843 mmWave radar board and ESP32 integrated inside a housing (b) Breakout view of the tinyRadar assembly where the radar board and ESP32 are connected using an interface board (c) Functional block diagram showing the RF front end, radar signal processing and classification module along with the result transmission module comprising of ESP32. The results are transmitted over BLE to a smartphone.
10:40 am to 11:05 am
An All-Digital Reconfigurable SRAM-Based Compute-in-Memory Macro for TinyML Devices
Runxi WANG, Ph.D Student, University of Michigan-Shanghai Jiao Tong University Joint Institute
Tiny machine learning (TinyML), which targets energy efficiency and low cost in AI-to-device integration, has been proven to be an effective implementation for data-intensive smart IoT devices. But device heterogeneity and MCU con- straints are two main challenges against further application based on tiny ML . Conventional computer architecture, like Von Neumann architecture, separates the memory and computing part, will incur significant costs in transferring data on TinyML devices. To overcome this barrier, emerging architecture called compute-in-memory (CIM) shed new light by eliminating the boundary of memory and storage. This is especially beneficial for tinyML models that run inference on power-constrained devices. Among different flavours of CIM architectures, SRAM-based CIM gains increasing interests due to its great compatibility with the advanced technology and performance advantages .
Device heterogeneity has introduced tremendous advantages to today’s computing landscape, yet it also leads to costly design efforts and longer time to market. This is even more critical in architecting CIM, where customization efforts are required . Learning from the history, reconfigurability has offered great opportunities for leveraging the performance tradeoffs and design costs. It has enabled generations of new architectures that can be reconfigured at different levels. Ex- isting CIM architectures were usually designed under the con- straint of very specific applications. For example,  proposed a 12T cell design useful for bit-wise XNOR calculation.  proposed two peripheral designs for bit-serial and bit-parallel vector accelerators for throughput-oriented and efficiency- oriented scenarios respectively. Design in  looked at full- system design of DNN computing cache architecture. There has been a few recent CIM work that attempted to integrate the recongigurability.  presented a system-level design that offers reconfigurable precision supports for cloud computing.  proposed a CIM macro supporting reconfigurable bit- wise operations. These work showed feasibility of adopting reconfigurable CIM architectures in AI applications and also great potential of integrating more functionalities on a single macro.
Inspired by these work, we propose to develop a cross-layer reconfigurable CIM solution, where opportunities need to be gained from cells, array, peripherals, to architecture and system designs. In this talk, we will show the progress of the first step on the roadmap where we design a novel all-digital SRAM cell macro that is able to support multiple commonly used operators from TinyML models in an all-in-one fashion.
Multiple compute-in-memory cell designs integrate the computing logic circuits mainly in the memory array pe- ripherals. This approach highly relies on operands’ physical locations, and performance improvement is thus limited. In order to support more flexibility for operands while still guaranteeing less latency for simple arithmetic operations, part of the computation workload along with the reconfigurable capability are integrated within the SRAM cell in this work. We introduce a novel 10T multi-logic SRAM cell that can support at least five basic operation modes, they are writ- ing, reading, and bit-wise AND/OR/XOR logical calculations. More complex functions were then be built upon these cell- level operations with extended circuitry. In Table I, we list the current supported function modes from the proposed CIM macro.
It is worth to mention that pipelined scheme has been em- ployed at the peripherals to further improve the performance. The design also makes use of concurrent reading of two adja- cent cells and separates the carry-ripple adder into two stages to reduce latency in addition operations. The proposed design has been implemented and simulated in Cadence Virtuoso with an industry-standard 28nm process node. Since it is purely digital, the design is synthesizable and compatible with the standard top-down digital implementation flow. In the talk, we will present detailed simulation results, design methodology and the comparisons against other state of the art designs. The final goal of this project is to deliver a system level compute-in-memory design approach that can support a wide range of tiny ML algorithms.
11:05 am to 11:30 am
Attacks on Tiny Intelligence
Yuvaraj GOVINDARAJULU, Senior Research Engineer, Bosch AI Shield
The field of TinyML has seen fast growth in the last few years in the areas of algorithms, hardware, and software, especially enabling several applications and use-cases. The applications include human-machine interface, health care, smart factories, monitoring and surveillance which are often safety critical. The availability of hardware, datasets and more importantly frameworks has enabled the so called “democratization” of TinyML. Recent works in Neuromorphic computing and On-device learning are significant in enhancing the capabilities.
Traditional AI systems are vulnerable to threats such as Model Theft, Model Evasion, Data poisoning and Membership inference. Several works related to the adversarial threats on traditional-AI systems have explored the vulnerabilities and made way to research on defenses on these systems. However, the threats on TinyML devices are not largely addressed. Especially since TinyML devices differ in various aspects compared to traditional-AI systems, the existing approaches from classical ML cannot be used for TinyML. Special security considerations become especially important since the tinyML devices are typically deployed in open (unmonitored) environments. Due to the use of similar tiny-devices in applications such as edge-computing, traditional Embedded Security has partly addressed the security of tiny-embedded devices. However, the security of AI on Tiny Embedded devices is a relatively unexplored area.
In this presentation, we talk about the need for security of tiny intelligence in addition to the classical embedded security. we present the results from our Embedded-AI Security research related to attacks on tiny intelligence. We provide an overview of the possible kill-chain for such attacks. Using relevant use-cases, we demonstrate how the AI models can be vulnerable and provide direction on prevention of such attacks through defense mechanisms.
China Standard Time (CST) / UTC+8
9:30 am to 9:35 am
Opening / Welcome Day 2
9:35 am to 10:10 am
In-memory computing and Dynamic Vision Sensors: Recipes for tinyML in Internet of Video Things
Arindam BASU , Professor, Department of Electrical Engineering, City University of Hong Kong
Vision sensors are unique for IoT in that they provide rich information but also require excessive bandwidth and energy which limits scalability of this architecture. In this talk, we will describe our recent work in using event-driven dynamic vision sensors for IoVT applications like unattended ground sensors and intelligent transportation systems. To further reduce the energy of the sensor node, we utilize In-memory computing (IMC)—the SRAM used to store the video frames are used to perform basic image processing operations and trigger the following deep neural networks. Lastly, we introduce a new concept of hybrid IMC combining multiple types of memory.
10:10 am to 10:35 am
Hardware-aware model optimization in Arm Ethos-U65 NPU
Shinkook CHOI, Lead Core Research, Nota Inc
As deep learning advances, edge devices and lightweight neural networks are becoming even more important. In order to reduce latency in the AI accelerator, it is important not only to reduce FLOPs but also to increase hardware efficiency. By analyzing the hardware performance of Arm Ethos-U65 NPU using Arm Virtual Hardware, we discovered that the latency of the convolution layer showed a staircase pattern varying on configuration. To utilize Arm Ethos-U65 NPU fully, the parameter of the compression methods should be set to include consideration of the latency pattern. By applying device characteristics of Arm Ethos-U65 NPU to NetsPresso which is a hardware-aware AI optimization platform, we adjusted the parameters of structured pruning and filter decomposition to fit a multiple-of-step size of the latency staircase pattern. In the image classification task, we validated the hardware-aware model compression increased FLOPs and accuracy at the same latency.
10:35 am to 11:00 am
Xylo: A sub-mW, low-dimensional Signal Neuromorphic Processor
Yuya LING, Algorithm Application Engineer, SynSense
The neuromorphic industry is growing rapidly, with significant contributions from large players such as Intel and IBM. SynSense is developing approaches for machine vision, bio- and industrial-signal processing, based on binary event-based neuronal architectures and communication emulated on low-power neuromorphic hardware. Our advantages are our long experience in designing low-power neuromorphic circuits, and our expertise in theory of neuromorphic computation.
Xylo is a new family of ultra-low-power (<1mW) neuromorphic ML inference processors designed by SynSense. Xylo is a fully programmable, ultra-low power neuromorphic architecture supporting feed forward, recurrent, reservior and other complex spiking neural networks. Xylo-A2 also integrates a dedicated analog Audio-Front-End supporting low-power real-time processing of single-channel audio signals.
Xylo provides efficient simulation of spiking leaky integrate-and-fire neurons with exponential input synapses. Xylo is highly configurable, and supports individual synaptic and membrane time-constants, thresholds and biases for each neuron. Xylo supports arbitrary network architectures, including recurrent networks, for up to 1000 neurons. With the adaption of SynSense’s novel spiking neural network algorithms, Xylo is able to power TinyML applications continuously at a few hundred uW.
11:00 am to 11:10 am
Automated Yard Processes using TinyML
Hpone Myat KHINE, Data Scientist Intern, SAP
Managing incoming and outgoing trucks at a yard is highly challenging as there are different trucks bound for different locations. Given the competitive pressure in the worldwide markets, it is useful to use professional logistics processes to help reduce complexity. The current approach allows for the Yard Manager to retrieve the tasks associated for a vehicle but it can be repetitive and thus could be further streamlined.
One such way to streamline processes is the automation of the detection of license plate numbers as trucks check-in at the yard. This project demonstrates the utility of the professional yard management software by building computer vision models in embedded devices to create a physical prototype which showcases the automated gate-in process.
Different from traditional computer vision models, the challenge lies in building a low-memory machine learning (TinyML) model which is suitable for embedded devices. In addition, the retrieving and sending of information such as the dangerous goods sign of a truck, and the driver’s next location are also integrated with the yard management software. Finally, the completed prototype will be housed on-site at the company, where it will be used as an innovation showcase.
11:10 am to 11:35 am
Power Efficient Neural Network Based Hardware Architecture for Biomedical Applications
Omiya HASSAN, Ph.D. Candidate/Graduate Instructor, Dept. of EECS, University of Missouri
The future of critical health diagnosis will involve intelligent and smart devices that are low-cost, wearable, and lightweight requiring power efficient hardware platforms. Various machine learning (ML) models such as deep learning architectures have been employed in the design of intelligent systems. However, the deployment of such systems on edge devices with limited hardware resources and power budget is difficult due to the requirement of high computational power and system accuracy. As a result, this creates a significant gap between the advancement of computing technology and the associated device technologies for healthcare applications. To address such issue, this research introduces power efficient ML based digital hardware design technique for a compact design solution while maintaining its optimal prediction accuracy. A hardware design approach
called SABiNN: Shift-Accumulate Based binarized Neural Network is proposed and developed which is a 2’s complement based binarized digital hardware technique. Neural network (NN) models such as feedforward FCNN models were selected to analyze and validate the proposed method. Deep compression learning techniques for hardware implementation such as n-bit integer quantization, and deterministic binarization on the model’s hyper-parameters were also employed. Instead of using matrix multiplication based multiply accumulate (MAC) function typically used in hardware accelerators, shifters and 2’s complements were introduced which significantly reduced the power consumption rate by 5x times. On CMOS platform, XNOR gates were re-designed with NAND gates
which significantly reduced signal noise and increased precision rate. Moreover, during inference there were no need of internal nor external memory devices for storing the networks’ weights and biases due to fixed hyper-parametric values. As a result, the processing power and model size were significantly reduced. For efficient use of the SABiNN method in biomedical applications, system architecture of a sleep apnea (SA) detection device is proposed which can detect SA events in real-time. The input to the system consists of two biosensor data such as ECG signal from the chest movement and SpO2 measurement from finger-tip pulse oximeter. In the training phase, open-source real patient datasets were used from PhysioNET bank. The proposed hardware model achieved a
targeted accuracy of 81.5% (CI: 3.5%) which is medically acceptable and realistic. During inference, all the parameters were successfully extracted, and each component was designed into re-programmable hardware before translated onto CMOS platform. General-purpose Nexys Artix-7 FPGA was used for hardware model validation and the final post-trained model was integrated on CMOS platform using both 130nm and 180nm design processes. A three hidden layer NN model was design on re-programmable hardware with 8-bit input and 1-bit output channels. The model was further expanded on silicon as a 4-layer hidden model with 16-bit input and 1-bit output channels which also increased the model accuracy to 88%. Activation functions such as Rectified Linear Unit (ReLU) was used on the hidden layers and hard sigmoid function was used at the output layer where both functions were designed using stacked 2:1 multiplexer. The final power consumption rate of
the model resulted in ~5W (5mJ) with 10ms clock delay on re-programmable hardware and ~10uW (1nJ) with 10ns clock delay on silicon. The success of this design technique can be leveraged in developing a fully system-on-a-chip (SoC) integrated biomedical system for wearable applications.
11:35 am to 12:00 pm
TILE-MPQ: Design Space Exploration of Tightly Integrated Layer-WisE Mixed-Precision Quantized Units for TinyML Inference
Xiaotian ZHAO, PhD Student, Shanghai Jiao Tong University
Deep learning and machine learning have enjoyed great pop- ularity and achieved huge success in the last decade. With the wide adoption of inferences on tinyML or Edge AI platforms, it becomes increasingly critical to develop optimal models that can fit the limited hardware budget . However, state-of-the- art models with higher accuracy also come with large sets of parameters which requires significant amount of computing resources and memory footprint. As a result, this limits the deployment of real-time DNN models on the edge, such as real- time health monitoring, autonomous driving, real-time transla- tion and various other application domains. Quantization has been demonstrated as one of the most efficient model compres- sion solutions for reducing the size and runtime of a network via reduction of bit-width , . Quantization reduces the precision of network components to the extent that it is more lightweight but without a “noticeable” difference in efficacy. The most straightforward way of applying quantization is called uniform quantization, where the resulting quantized values (aka quantization levels) are uniformly space. For example, uniform INT8 has been widely used in inference AI models . As the demand for inference speed keeps increasing, and the on-device storage becomes limited, lower precision quantization scheme such as INT4 has been employed. But uniform scheme (e.g. all layers are quantized to 4 bits) can cause significant accuracy loss.
In order to further reduce the hardware consumption and compress the model, mixed precision quantization (MPQ) was proposed –, where some layers maintain higher precision and others are kept lower precision quantization. Therefore, it strikes a better balance between accuracy and hardware cost –. However, many of these techniques were purely accuracy driven without considering hardware implementation costs. When it comes to the edge inference units, the limitation set by hardware resources requires a very fine-grained design space searching process that is able to provide explainable tradeoffs. This certainly applies to the problem of finding the best mixed-precision quantization scheme at the edge that needs to be both hardware and application friendly. While there were few existing MPQ solutions that introduced hardware awareness in the search process, the searching is either com- puting intensive or model specific. Thus a light and explainable hardware-aware MPQ methodology is required. In this work,
we propose a novel MPQ searching algorithm that firstly “sam- ples” the layer-wise sensitivity with respect to a newly proposed metric, which incorporates both accuracy (from application perspective) and proxy of cost (from hardware perspective). Based on the hardware budget and accuracy requirements, a candidate search space can be decided, and the optimal MPQ scheme is further explored. Evaluation results show that the proposed solution achieves 3%-11% higher inference accuracy with similar hardware cost compared to the state of the art MPQ strategies. At the hardware level, we propose a new processing- in-memory (PIM) architecture that tightly integrates the optimal MPQ policies as part of the processor pipeline through Instruc- tion Set Architecture (ISA) and micro-architecture co-design. To summarize, this work makes the following contributions.
- We define a new metric that takes consideration of hard- ware cost and accuracy loss simultaneously while perform- ing the search of optimal MPQ schemes.
- We propose a light MPQ searching methodology by an- alyzing single layer sensitivity first and then narrowing down the search space. The proposed strategy outperforms other existing hardware-aware MPQ search solutions for lightweight neural network models.
- We look into the adaptation of the optimal MPQ schemes at the hardware level and propose a tightly integrated layer-wise quantized mixed-precision unit that resides in the CPU pipeline. The customized architecture allows seamless processing of both AI and non-AI tasks on the same hardware.
- In this talk, we will discuss the details of the proposed search methodology for layer-wise hardware-aware MPQ, including the design space exploration, hardware architecture design methodology and evaluation results.
Schedule subject to change without notice.
Seeed Studio and Chaihuo makerspace
Odin SHEN 沈綸銘
Chetan SINGH THAKUR
Indian Institute of Science (IISc)
Qualcomm Research, USA
City University of Hong Kong
Western Sydney University
Bosch AI Shield
Dept. of EECS, University of Missouri
Hpone Myat KHINE
NeuRonICS Lab, Indian Institute of Science
University of Michigan-Shanghai Jiao Tong University Joint Institute
Shanghai Jiao Tong University