About tinyML Asia
Machine learning (ML) is at the forefront of providing artificial intelligence to all aspects of computing. It is the technology powering many of today’s advanced applications from image recognition to voice interfaces to self-driving vehicles and beyond. Many of these initial ML applications require significant computational resources most often found in cloud-scale data centers. To enable industry usage and adoption, it is, therefore, necessary to significantly reduce the power consumed to bring applications to end devices at the cloud edge (smartphones, wearables, vehicles, IoT devices, etc.) and to reduce the load on required data center resources.
Sky 31 Convention
300, Olympic-ro, Songpa-gu, Seoul 05551, Republic of Korea https://www.sky31convention.com/main/skyMain.do
9:00 am to 9:15 am
9:15 am to 9:25 am
Session Moderator: Aruna KOLLURU, Chief Technologist, AI, Asia Pacific, Dell Technologies
9:25 am to 10:15 am
Keynote by Kyuwoong Hwang from Qualcomm AI Research, Korea
10:15 am to 11:00 am
Solutions and Applications
How to enable seamless TinyML development and deployment
Georgios FLAMIS, Sn. Staff System Design engineer , Renesas
The widespread adoption of Tiny Machine Learning (TinyML) in resource-constrained edge devices necessitates the development of a scalable MCU/MPU line-up and an efficient ecosystem and development tools tailored to address the unique challenges of deploying machine learning models in resource-limited environments. In this work, we present the scalable Renesas MCU/MPU lineup with a dedicated ecosystem and development tool framework designed to streamline the development, deployment, and management of TinyML applications.
The proposed product families with the ecosystem comprises a seamless integration of hardware, software, and cloud components, catering to the specific needs of TinyML. At the hardware level, we introduce a versatile set of devices to TinyML, allowing developers to adapt them to various edge devices, enabling efficient and low-power inference.
In tandem with the hardware, we present a comprehensive development tool framework explicitly tailored for TinyML development which ensure that the trained models are lightweight and suitable for deployment on edge devices with limited memory and processing capabilities. Herby targeting also the 16-bit devices from Renesas too.
To demonstrate the efficiency of our ecosystem and development tool framework, we present several real-world use cases, including VUI – voice command recognition on an Arm Cortex M23 device running at 48MHz, anomaly detection, and predictive maintenance on Arm Cortex M33 device, wild animal detection using the Dynamic Reconfigurable Processor – AI (DRP-AI), and person access system running just at 120MHz single core Arm Cortex M4 device from Renesas as edge devices. Benchmarking results showcase superior performance, with significant reductions in memory footprint, energy consumption, and inference latency compared to traditional TinyML implementations.
Furthermore, the extensibility and modularity of our application examples within our ecosystem allow easy integration with emerging TinyML algorithms in different multimodal applications and systems.
In conclusion, our device family together with our ecosystem and development tool framework provide a holistic solution for TinyML deployment, addressing the challenges of resource constraints while enabling efficient and intelligent applications on edge devices. This work contributes to advancing the field of TinyML, driving the proliferation of edge intelligence, and enabling novel and innovative use cases across various industries.
Towards Efficient Neural Rendering
Sungjoo YOO, Director, Neural Processing Research Center (NPRC)
Neural rending, aka NeRF (neural radiance field), is expected to play a crucial role on AR/VR devices like Vision Pro.
The original NeRF method demonstrated the feasibility of novel view synthesis only using multi-view images and their camera poses. However, it also presented new challenges, especially, a computational inefficiency problem that needs to be resolved in order to offer users real-time photorealistic experiences.
In only a few years since the introduction of NeRF, significant (>1000x) improvements in computational efficiency have been achieved thanks to tremendous efforts by numerous researchers.
In this talk, we will explain how such a drastic change was possible, mostly via realizing pre-computation on voxel grids or exploiting existing optimized graphics pipeline, and what new challenges are ahead towards true realism needed to make AR/VR devices everyday necessities.
11:00 am to 11:30 am
Break and Networking
11:30 am to 12:15 pm
Algorithms and Software
LLM quantization by Prof. Choi
Benchmarking AI compiler for the TinyML market
Peter CHANG, Business Development Manager, Skymizer
As Machine Learning continues to expand its influence across various domains, AI developers are increasingly focusing on deploying AI models on tiny devices like MCUs. These devices present unique challenges and limitations, such as constrained energy consumption. Skymizer, a company specializing in AI system software, has recognized the potential of AI in such small-scale devices. As a crucial initial stride, Skymizer has provided benchmark figures to the publicly announced MLPerf Tiny v1.1 in June 2023. The demonstration employs two distinct MCUs: one sourced from Skymizer’s partner Nuvoton and another from STMicrocontroller. The primary aim is to showcase the capabilities of Skymizer’s AI compiler, ONNC, when applied to the TinyML market. The outcomes underscore ONNC’s advantages in terms of performance, energy efficiency, and memory utilization. This presentation will begin by introducing the MLPerf Tiny benchmark and delving into our firsthand encounter with the submission process. Subsequently, we will explore the optimization of AI models for hardware from the perspective of a compiler. Lastly, we will share our thoughts on the possibility of the future benchmark designs to accommodate the emerging TinyML market.
12:15 pm to 12:45 pm
Posters & Demos
12:45 pm to 1:45 pm
1:45 pm to 2:30 pm
Keynote by Saeed Afshar from Western Sydney University
Optimized Event-Driven Spiking Neural Network for Low-Power Neuromorphic Platform
We present modular event-driven spiking neural architectures that use local synaptic and threshold adaptation rules to perform transformations between arbitrary spatio-temporal spike patterns. The architectures represent a highly abstracted model of existing Spiking Neural Network (SNN) architectures. We showcase Optimized Deep Event-driven Spiking neural network Architecture (ODESA) that can simultaneously learn hierarchical spatio-temporal features at multiple arbitrary time scales. ODESA performs online learning without the use of error back-propagation or the calculation of gradients. Using simple local adaptive selection thresholds at each node, the network rapidly learns to appropriately allocate its neuronal resources at each layer for any given problem without using an error measure. We also showcase an efficient hardware implementation of ODESA on FPGA. The computations performed by the architectures are event-driven and the entire communication between layers is binary event-based. We provide a potential blueprint for future neuromorphic architectures that can be run asynchronously and enable low-power always-on learning systems.
2:30 pm to 3:15 pm
Algorithms and Software
Target Classification on the Edge using mmWave Radar: A Novel Algorithm and Its Real-Time Implementation on TI’s IWRL6432
Muhammet Emin YANIK, Radar R&D Systems Engineer, Texas Instruments
The past few years have seen a significant increase in sensing applications in the mmWave spectrum, primarily due to the advances in mmWave technology that have reduced the size, cost, and power consumption. While mmWave radars have been traditionally employed in target detection and tracking, there has been a trend toward using radar signals for target classification. Many of these classification algorithms use machine learning (ML) to leverage the radar’s high sensitivity to motion. Examples include motion classification for reducing false alarms in indoor and outdoor surveillance, fall detection for elderly care, and more. In our study, we developed and implemented a mmWave radar-based multiple-target classification algorithm for cluttered real-world environments, emphasizing the limited memory and computing budget of edge devices. The following figure depicts the high-level processing flow of the overall algorithmic chain running on TI’s low-cost and ultra-low-power integrated radar-on-chip device, IWRL6432, in real-time.
The analog/digital front-end of the IWRL6432 mmWave radar first generates the radar cube (i.e., ADC raw data) in all range (after in-place range-FFTs), chirp, and antenna (two transmitters, three receivers) domains. The detection layer then processes the ADC data using traditional FFT processing (Doppler-FFT and angle-FFT), followed by a peak detection step to generate a point cloud. The point cloud is then clustered into objects using a multiple-target tracker algorithm (based on an Extended Kalman Filter). Although we develop and implement this algorithm on IWRL6432, it is important to note that it is also generic to be deployed on other TI device variants.
The main challenge addressed in our study is to generate µ-Doppler spectrograms from multiple targets concurrently to be used in the following modules. To achieve this goal, we developed and implemented a solution to create µ-Doppler spectrograms using multiple target tracker layer integration. The centroid for each tracked object is mapped to the corresponding location in the radar cube, and a µ-Doppler spectrum is extracted from the neighborhood of this centroid (which is repeated for each track). The µ-Doppler spectrum is concatenated across multiple consecutive frames to generate a 2D µ-Doppler vs. time spectrogram. Compared to prior work (mostly lacks the essential elements needed to achieve a reliable solution for real-world problems with multiple target scenarios and typically assumes single-target scenes), our invention proposes a novel approach and implements it efficiently on edge to classify multiple target objects concurrently.
Although most prior art processes the generated 2D µ-Doppler vs. time spectrogram through a 2-D CNN model for classification, we further process the 2D spectrogram to create hand-crafted features (a small set of parameters that capture the essence of the image), which reduces the computational complexity and makes multiple target classification possible on edge. In this approach, a single frame yields a single value for each feature. Therefore, the sequence of frames generates a corresponding 1D time series data for each feature. In this study, six features are extracted from µ-Doppler spectrograms for use by the classifier. We then build and train a 1D-CNN model to classify the target objects (human or non-human) given the extracted features as 1D time series data. Since a 2D-CNN model is memory and MIPS-intensive, too complex to support multiple tracks on low-power edge processors, and needs massive data for training, the proposed approach results in lower complexity.
In our 1D-CNN network architecture, the input size is configured as the total number of features. Three blocks of 1D convolution, ReLU, and layer normalization layers are used. To reduce the output of the convolutional layers to a single vector, a 1D global average pooling layer is added. To map the output to a vector of probabilities, a fully connected layer is used with an output size of two (matching the number of classes), followed by a softmax layer and a classification layer. In the model training step, a total of 125816 frames of data were captured at 10Hz (corresponds to 3.5 hours in total) from different human and non-human targets. In the data capture campaign, 310 scenarios are created in distinct environments with around 20 different people and numerous non-human targets (e.g., fan, tree, dog, plant, drone, etc.). The data is captured with synchronized video to assist in labeling. In total, 55079 observations (25448 humans, 29631 non-humans) are generated when the time window size is configured to 30 frames (3sec at 10Hz).
The developed algorithm and the pre-trained 1D-CNN model are then implemented on IWRL6432, ideally suited for implementing the above applications. IWRL6432 has an integrated RF front-end and a Hardware Accelerator (HWA) running at 80MHz optimized for radar signal processing. The on-chip CPU, Arm® Cortex®-M4F (@160MHz), provides sufficient compute capability for both tracking and classification. The device is designed with various low-power modes to support applications where minimizing power consumption is critical. In our real-time implementation, the heavy detection layer processing is handled by the HWA with minimum CPU intervention. The tracker, feature extraction, and classifier blocks run solely on the CPU. The µ-Doppler generation step runs on the HWA with a minimum amount of assistance from the CPU to achieve an efficient and optimized implementation on IWRL6432. Although we chose the extracted 1D time sequence features followed by a 1D-CNN architecture as the lightest-weight classifier, our real-time implementation also supports streaming the 2D µ-Doppler vs. time images per track over UART to enable more complex classifier algorithms on external MCUs, if available.
The performance of the classifier is characterized using two approaches. The first approach uses all the available data combined, shuffled, and split as 40% for training, 10% for validation, and 50% for testing. The second approach reserves some specific portion of the data set for testing, consisting of human and non-human subjects that were completely absent from the training set. With either approach, the overall classification error is calculated using 5-fold cross-validation. For each iteration (i.e., fold), data is reshuffled, and the accuracy is calculated with different random combinations of training, validation, and test data. The average test accuracy obtained from each iteration is then reported as the final accuracy metric. We show that the proposed 1D-CNN classifier can achieve about 97% average accuracy in challenging test scenarios with a 3sec window. In the real-time implementation of the algorithm on IWRL6432, the entire processing chain summarized in the figure above takes only 35ms for three tracks, and each additional track requires around 3.5ms processing cost to achieve a 10Hz frame rate successfully.
PyNetsPresso and LaunchX: An Integrated Toolchain for Hardware-Aware AI Model Optimization and Benchmarking
Jongwon BAEK, Product Owner, Nota AI
Developing optimized AI models for specific hardware configurations poses challenges due to the iterative nature of experimentation and analysis required in the process. To address this issue, we present “PyNetsPresso” – a Python-based interface for “NetsPresso,” an AI model optimization tool. PyNetsPresso streamlines the development pipeline, enabling easy AI model training, lightweighting, model conversion for hardware-aware runtimes, and efficient benchmarking.
AI engineers often encounter exhausting and complex iterations involving training, lightweighting, model transformation, and benchmarking, making the AI model development process tough. PyNetsPresso offers a solution by automating these steps into a condensed pipeline, reducing repetitive efforts, and simplifying analysis of trial results. PyNetsPresso incorporates hardware-aware optimization techniques, including pruning, filter decomposition, and quantization, which minimize performance degradation while accommodating hardware constraints and runtime considerations. Moreover, it facilitates easy benchmarking across a variety of hardware setups, allowing AI engineers to make informed decisions. In conjunction with PyNetsPresso, we introduce “LaunchX” – a user-friendly GUI tool that empowers AI model developers to visualize the impact of runtime and hardware configurations on speed and memory occupancy without necessitating full pipeline execution. This GUI tool is especially beneficial for users with pre-trained
models, providing quick insights into model performance on various runtime and hardware combinations.
In summary, the PyNetsPresso and LaunchX integrated toolchain significantly streamlines the AI model optimization process for hardware-aware deployment, empowering AI engineers to efficiently create customized and high-performing models for diverse hardware setups.
3:15 pm to 3:35 pm
Accelerate Baidu PaddlePaddle Edge AI Software Development with Arm Virtual Hardware
Liliya WU, Ecosystem Specialist, arm
PaddlePaddle, as the first independent R&D deep learning platform in China, has been officially open-sourced to professional communities since 2016. It has been adopted by a wide range of sectors including manufacturing, agriculture, enterprise services etc, while serving more than 5.35 million developers, 200,000 companies and generating 670,000 models. With such advantages, PaddlePaddle has helped an increasing number of partners commercialize AI.
PaddlePaddle is extending from cloud to edge with models running on endpoint devices. These devices vary significantly on amount of memory, hardware and software architectures and runtime. The result is that AI algorithm, model and app-developers are challenged to build, optimize, test, and deploy advanced AI models to fit for device diversity with ease and confidence. Baidu PaddlePaddle and Arm have been collaborating to accelerate endpoint AI design and deployment at scale leveraging Apache TVM as well as Arm Virtual Hardware (AVH).
In this paper, we will outline practical challenges in edge AI, how PaddlePaddle on MCUs addresses these challenges by leveraging code generation technology as well as AVH, as scalable test platforms for developers to verify and validate PaddlePaddle for endpoint AI application during the complete ML model and software design cycle. By adopting deeper integration with local cloud service provider Baidu cloud, various use cases are developed and verified on Arm Virtual Hardware.
3:35 pm to 4:00 pm
Posters & Demos
4:00 pm to 4:30 pm
Break and Networking
4:30 pm to 5:15 pm
Oculi Enables 600x Reduction in Latency-Energy Factor for Visual Edge Applications by Moving Sensors from Imaging to Vision
Charbel RIZK, Founder and CEO, Oculi
In recent times, notable progress has been achieved in the field of artificial intelligence, particularly in the extensive use of deep neural networks. These advancements have significantly enhanced the reliability of face detection, eye tracking, hand tracking, and people detection. However, performing these tasks still demands substantial computational power and memory resources, making it a resource-intensive endeavor. Consequently, power consumption and latency pose significant challenges for many systems operating in always-on, edge applications. The OCULI SPU represents an intelligent, programmable vision sensor capable of being dynamically configured to output select data in various modes. These modes include images or video, polarity events, smart events, and actionable information. Moreover, the SPU allows real-time programmability of spatial and temporal resolution, as well as dynamic range and bit depth. By enabling continuous optimization, computer/machine vision solutions deploying the OCULI SPU, in lieu of imaging sensors, can reduce the latency-energy factor by more than 600x at a fraction of the cost. It’s important to note that the smart events and actionable information outputs and modes are distinctive features unique to Oculi. THE OCULI SPU is ideal for TinyML vision applications. To showcase these capabilities, we are participating in the tinyML competition entitled “tinyML Hackathon 2023: Pedestrian Detection”. Our initial results demonstrate an always-on solution (24/7) with a latency of less than 4 ms that only consumes 3 W-hr total for a whole year, that’s the equivalent of a single AA battery.
Because the OCULI SPU is fully programmable, the solution can be dynamically optimized between latency and power consumption. It will enable the first truly wireless battery-operated always-on vision products in the market. This presentation will provide an overview of Oculi’s novel vision architecture for edge applications. It will also include key results for latency and energy results for multiple use cases of interest to the TinyML community such as presence/people/pedestrian/object, face, hand, and eye detection. The results will also include a comparison with alternate or conventional solutions.
All Analog Compute for Ultra-Low Power Neural Network Processing
Roger LEVINSON, CEO, Blumind
Blumind has developed an all-analog neural network processor capable of realizing nW consumption on real world problems. Leveraging standard CMOS with no special masking steps, Blumind has combined the best attributes from neuromorphic compute, precision analog processing, device physics for in memory compute and data science to create the AMPLTM solution. In this presentation we will cover the fundamentals of the Blumind architecture, including the unique approach to leveraging standard CMOS devices for coefficient storage and synaptic computation. We will provide an example audio solution in which the RNN achieves multiple KWD in less than 100nW leveraging an all analog feature extraction implementation. We will present a CNN vision solution based upon the AMPLTM architecture for always
on object detection, such as human presence detection which consumes less than 1uJ per inference. The technical and market benefits of direct connection to analog sensors and direct processing of analog signals will be outlined and the significant 20X power advantage of such a solution in the audio application will be described. Blumind has demonstrated the solution in silicon and is currently building its first commercial product and open to custom collaboration and partnership.
5:15 pm to 5:35 pm
Solutions and Applications
Brainchip’s Specialized Akida Architecture for Low Power Edge Inference with HW Spatiotemporal Acceleration and On-chip Learning
Nandan NAYAMPALLY, CMO, Brainchip
Brainchip’s Innovative Akida v2.0 Architecture
- Event-based, neuromorphic processing
- Extremely efficient computation only processes and communicates on events resulting in minimal latency and power consumption. Hardware accelerates event domain convolutions, temporal convolutions, and vision transformers while consuming very low power for battery-operated devices.
- At-memory compute
- Low-latency and low-power computation via parallel execution of events propagating through a mesh-connected array of Neural Processing Units (NPUs) each with a dedicated SRAM memory.
- Quantized parameters and activations
- Supports 8, 4, 2, 1-bit parameters and activations on layer-by-layer basis for model implementations tailored for optimum accuracy versus size, power, and latency.
- Accelerates Traditional AI models
- The MetaTF software stack quantizes and converts models in Keras/TensorFlow and Pytorch/ONNX to Akida event-driven models.
- Fully-digital Implementation
- Portable across process technology with standard synthesis design flow.
- Plus Novel Architectural Features below …
Select Architectural Highlights
- Temporal Event-based Neural Networks (TENNs)
- Radically reduces number of parameters and operations by an order of magnitude or more for spatiotemporal computations with 2D streaming input such as video and 1D streaming input such as raw audio. Easily trained with back propagation like a CNN and inferences like a RNN. Extracts 2D spatial and temporal 1D kernels for 3D input and extracts temporal 1D kernels for 1D input.
- Vision Transformers (ViTs)
- Multi-head attention included in hardware implementation of Transformer Encoder block with configurations from 2 to 12 nodes. A two-node configuration at 800 MHz executes TinyViT at 30 fps (224x224x3).
- On-device learning
Brainchip’s patented algorithm for homeostatic Spike Time Dependent Plasticity (STDP) enables on-device learning. AI applications can implement incremental learning so that models can adapt to changes in the field. On-device learning can add new classes to a model while leveraging features learned in its original, off-line training.
Model Development Methodologies
- No-code Model Development and Deployment with Edge Impulse Users with limited AI expertise can develop and deploy optimized Akida models.
- Build-Your-Own-Model with Brainchip’s MetaTF Experienced AI users can build and optimize custom Akida models.
- Video Benchmarks
- Audio and Medical Vital Signs Benchmarks
- Image Benchmarks
5:35 pm to 5:45 pm
Schedule subject to change without notice.
LG Electronics CTO AI Lab
Chetan SINGH THAKUR
Indian Institute of Science (IISc)
Qualcomm Research, USA
Morning Keynote Speaker
Qualcomm AI Research, Korea
Muhammet Emin YANIK
Neural Processing Research Center (NPRC)