October 10 - 12, 2022

tinyML EMEA Innovation Forum 2022

The tinyML EMEA Innovation Forum brings together key industry leaders, technical experts, and researchers, from Europe, the Middle East, and Africa (EMEA) region, innovating with machine learning and artificial intelligence on ultra low-powered devices.

Tiny machine learning combines innovation across a deep technological stack ranging from dataset collection and ML application design, through innovative algorithms and system-level software, down to hardware and novel sensor technology. As a cross-layer technology, achieving good results in tinyML requires carefully tuning the interaction between the various layers, and designing systems with a vertically integrated approach. In the EMEA region many startups, established companies, universities, and research labs are investing substantial time and effort in developing novel advanced solutions for tinyML. The tinyML EMEA Innovation Forum aims to connect these efforts, find strength in unity, and cooperate to accelerate tinyML innovation across the region.

Venue

Grand Resort

Limassol, Cyprus

Room Booking

Contact us

Rosina Haberl

enohP

liaM

KEYNOTE SPEAKERS

Massimo Banzi, Co-founder Arduino
Alberto Sangiovanni Vincentelli, Co-founder Cadence

PROGRAM PREVIEW

Innovation Showcases

To kick off the EMEA Innovation Forum 2022 on Monday, October 10, we will focus on what is possible with tinyML today and explore some of the most innovative development tools and techniques available in the market to build working tinyML solutions.

Join us to experience eight first-hand practical and interactive showcases and demonstrations by industry experts, where you will have a chance to engage with the speakers and dive deep into the technology. Following each set of demonstrations, we will hold an interactive roundtable with the presenters to explore trends, challenges, and opportunities in the field.

Full Stack Solutions

The overall objective of tinyML is to see autonomous devices able to react and interact with their environment. This can only happen when machine learning models are optimized for the device hardware and software and potentially linked to remote systems to provide smart functions, alerts, and user adaptation. The full-stack applications session will cover uses of tinyML in such systems comprising hardware, software, sensing, and backend operations to deliver compelling user experiences

Algorithms, Software & Tools

Algorithms and software designed for tiny devices are a crucial part in the enablement of tinyML technology. Deployment at scale is the next challenge in making tinyML solutions ubiquitous in our lives.

Bringing highly efficient inference models to real devices requires innovation in key areas of network architecture, toolchains, and embedded SW to optimize performance and energy.

The Algorithms, SW & tools session will explore some of the recent advancements in the field while uncovering some of the challenges we are facing as an industry.

tinyML Ecosystem Development Panel

Towards a thriving TinyML Ecosystem Strategy and Roadmap for EMEA.

This session will bring together a panel drawn from TinyML stakeholders including industry players, academics, and policy experts to discuss strategies needed for a thriving tinyML ecosystem. In particular, we will explore policy directions from the AU and EU and strategic partnerships needed to promote innovation in TinyML.

Hardware & Sensors

This session will cover innovation and advances within tinyML HW and sensors.

Improvement of tinyML applications is dependent on innovation in both the tinyML HW as well as in the sensor technology. For many applications, the sensor and processing will be highly integrated and make use of new innovative technologies to push the boundaries for power consumption and the necessary computing, both in terms of memory technology and arithmetic design, including the use of analog. New sensors and new hardware push the need for innovation in tooling, for example for novel ways to quantize and prune networks.

On Device Learning

To date, the tinyML community has been successful in creating outstanding initiatives which established on-device Machine Learning inference at an industry level as an essential technology.

Looking forward to the ML technology evolution, experts are increasingly asking how to let tiny devices adapt to the variability of the environments in which on-device inference is deployed, thereby compensating for concept and data drifts.

To follow up with that need, the tinyML foundation created a focus working group on On Device Learning (ODL), whose goal is to make edge devices smarter, more efficient, and self-adaptive by observing changes in the collected data, potentially in collaboration with other deployed devices.

As a further contribution, the tinyML EMEA 2022 innovation forum will include an ODL session with presentations and potentially demos at the upcoming in-person event in Cyprus, thereby sharing with academia and industry experts their current research and solutions for ODL.

Schedule

Speakers

Commitee

Downloads

News

August 02, 2022

tinyML EMEA Innovation Forum 2022 Sponsorship Opportunities

The tinyML EMEA Innovation Forum 2022 will continue the tradition of high-quality state-of-the-art presentations. Find out more about sponsoring and supporting the tinyML Foundation.

August 02, 2022

EMEA 2022 Venue Booking

The Grand Resort is a 5-Star facility in Limassol, Cyprus.

Schedule

9:30 am to 9:45 am

Welcome

9:45 am to 10:45 am

Innovation Showcase Session

In this session, we will have eight slots for demonstrations of vendor-related or independent tinyML tools. The talks will be hands-on, interactive demonstrations by the speaker, showing a real example of how the demonstrated tool helps the TinyML product developer.

Following each set of talks, we will hold a Roundtable discussion with the presenters. This will insure an interactive, engaging the first day that will help attendees deepen their understanding of what is possible with tinyML.

Session Moderator: Alessandro GRANDE, Head of Product, Edge Impulse

Tuning for tinyML

Felix Johnny THOMASMATHIBALAN, Staff Engineer, Arm

Abstract (English)

Welcome to a walkthrough of an end-to-end (tiny)ML workflow showcased using Google Colab. The session focuses on the design aspects of a tinyML system and how they relate to different stages of the ML workflow. Ways to fine-tune the flow to get the best out of the tinyML system in terms of speed and size will be discussed. A SparkFun MicroMod system is used along with TensorFlow Lite for Microcontrollers and CMSIS-NN, an optimized library for Arm Cortex-M CPUs.

Boost platform performance with Apache TVM by enabling custom platform capabilities: Qualcomm Adreno Texture support as an example

Iliya SLAVUTIN, CTO and Co-founder, Deelvin Solutions

Abstract (English)

Apache TVM is a modern and highly modular ML compiler, which allows HW vendors and other interested parties to leverage/re-use a significant amount of stack and algorithms, developed by the community while focusing on enabling their own platform-specific features and capabilities.

This session will walk you through the process of enabling custom features with Apache TVM by using recently enabled Adreno Texture support as an illustrative example.

You will learn key architecture blocks of Apache TVM, understand which modules should be expanded and how the remained part of the stack is re-used to combine the algorithmic power of Apache TVM with platform capabilities to bring a significant performance boost

10:45 am to 11:15 am

Coffebreak & networking

11:15 am to 12:15 pm

Innovation Showcase Part II

Session Moderator: Alessandro GRANDE, Head of Product, Edge Impulse

GAPflow – From TFLite and ONNX to highly optimised C code running on ultra low power GAP processors

Marco FARISELLI, Embedded ML Engineer, Greenwaves Technologies

Rockpool: a python package for design and training of light and ultra low-power edge ML applications

Mina KHOEI, Senior Algorithm and ML applications engineer, SynSense AG

Abstract (English)

In this session we introduce Rockpool, an open-source python library for designing and training low power, low latency spiking neural network (SNN) applications for edge computing.

Rockpool has been co-designed and developed with Xylo, SynSense’s ultra low-power chip for low-dimensional signal processing. Therefore, the design and training of SNNs can be constrained to the architecture requirements and memory and operation bandwidth of the edge device. Certain constraints like arrangements of layers and synapse, max spike number per neuron, and quantised parameter ranges can be imposed during design and training.

Rockpool is a deep-learning library specifically for SNNs at the edge. It adopts conventions from industry deep learning libraries, and makes use of several automatic differentiation pipelines to enable end-to-end training of SNNs, by incorporating surrogate gradients in gradient descent and backpropagation frameworks with torch and jax backends.

It offers a user-friendly framework to train the rich parameter space of SNNs while applying parameters constraints during training. In addition to linear weights, these comprise time constants of synapses and neurons and firing thresholds of neurons. These parameters expand the expressivity of an SNN over a standard ANN, enabling SNNs to perform richer temporal signal processing and analysis.

12:15 pm to 12:45 pm

Roundtable

12:45 pm to 2:15 pm

Lunch & Networking

2:15 pm to 3:15 pm

Innovation Showcase Part III

Session Moderator: Martin CROOME, Vice President Marketing, GreenWaves

The benefits of streaming analytics on edge devices and microcontrollers

Johan RISCH, Lead Developer, Stream Analyze

Abstract (English)

TinyML is an emerging area with great potential to bring intelligence to all kinds of products and equipment. One of the enabling areas is tools and platforms supporting the development of models and solutions to work well on resource-constrained devices. There is currently groundbreaking innovation happening in the area where several players
and startups develop tools to simplify and accelerate the development of TinyML solutions.
Stream Analyze works with a novel conceptual solution combining the benefits of streaming analytics and on-device edge AI into one unified solution. The setup enables an edge AI platform with very fast end-to-end processes.

The demo will cover the key benefits of the combination of streaming analytics and on-device edge AI and how fast and efficient processes are enabled end-to-end. The following steps of the process of developing and deploying edge AI solutions based on streaming
analytics will be presented:
• Enabling real-time data streams on edge devices and microcontrollers
• Query and filtering capabilities to filter out the data streams needed
• How to develop and apply computations and ML/AI models to data streams
• How to deploy models to edge devices instantly (Bypassing the need
for FOTA updates)
• Orchestration capabilities to enable management of portfolios of
models across large fleets of industrial products
• Real-time visualization of data streams and model results
• How the high-level scripting language used in the platform enables groups without deeper coding/data science skills, to develop edge AI solutions
• Examples of use cases from large industrials and automotive companies

Reducing Model Size with Pruning and Quantization

Gregory COPPENRATH, Product Marketing Manager, Mathworks

Abstract (English)

In the pursuit of more accurate AI models, the size of the models has increased, making deployment to resource constrained microcontrollers difficult. In this demo, we will show a workflow in MATLAB to identify prunable features in a model, remove some, and then recheck model accuracy. After each automated pruning step, there is a retraining step to recover some of the lost accuracy of the model.
The network will be pruned until the accuracy significantly drops indicating that the pruning should be stopped to avoid damaging the network’s ability to accurately predict for the application. The model is then further compressed through quantization of the pruned network. The resulting network size and accuracy will be explored as well as other visualizations that show what is changed during compression.
This reduced network allows a lower resource microcontroller to be selected and increased resources for other operations on the device.

3:15 pm to 3:45 pm

Coffebreak & networking

3:45 pm to 4:45 pm

Innovation Showcase Part IV

Session Moderator: Martin CROOME, Vice President Marketing, GreenWaves

A Low Power And High Performance Software Approach to Artificial Intelligence On-Board

Guillermo SARABIA, Researcher, Klepsydra

Abstract (English)

New generations of spacecrafts are required to perform tasks with an increased level of autonomy. Space exploration, Earth Observation, space robotics, etc. are all growing fields in Space that require more sensors and more computational power to perform these missions. Furthermore, new sensors in the market produce better quality data at higher rates while new processors can increase substantially the computational power. Therefore, near-future spacecrafts will be equipped with large number of sensors that will produce data at rates that has not been seen before in space, while at the same time, data processing power will be significantly increased. Use cases like guidance navigation and control applications, vision-based navigation has become increasingly important in a variety of space applications for enhancing autonomy and dependability.
Future missions such as Active Debris Removal will rely on novel high-performance avionics to support image processing and Artificial Intelligence algorithms with large workloads. Similar requirements come from Earth Observation applications, where data processing on-board can be critical in order to provide real-time reliable information to Earth. This new scenario of advanced Space applications and increase in data amount and processing power, has brought new challenges with it: low determinism, excessive power needs, data losses and large response latency.

In this demo, a novel approach to on-board artificial intelligence (AI) is presented that is based on state-of-the-art academic research of the well known technique of data pipeline. Algorithm pipelining has seen a resurgence in the high performance computing work due its low power use and high throughput capabilities. The approach presented here provides a very sophisticated threading model combination of pipeline and parallelization techniques applied to deep neural networks (DNN), making these type of AI applications much more efficient and reliable. This new approach has been validated with several DNN
models and two different computer architectures. The results show that the data processing rate and power saving of the applications increase substantially with respect to standard AI solutions.

Build a people counter with an MCU-based camera device

Louis MOREAU, Senior DevRel Engineer, Edge Impulse

Abstract (English)

In this session, we will go over the different existing image processing approaches used in
tinyML, from simple and lightweight image classification techniques to more complex and
compute-intensive ones such as object detection or image segmentation, typically not suitable for constrained devices.
We will then introduce a new technique that can be used to perform object detection on
constrained devices: FOMO.
FOMO (Faster Objects, More Objects) is a novel open-source machine-learning algorithm that enables object detection on microcontrollers.
With this new architecture, you will be able to find the location and count objects, as well as
track multiple objects in an image in real-time using up to 30x less processing power and
memory than MobileNet SSD or YOLOv5.

4:45 pm to 5:15 pm

Roundtable

9:00 am to 9:15 am

Welcome

9:15 am to 10:00 am

Keynote

1 Million TinyML developerslessons learned from Arduino

Massimo BANZI, Co-founder, Arduino

10:00 am to 11:00 am

Full Stack Solutions session - Part I

The overall objective of tinyML is to see autonomous devices able to react and interact with their environment. This can only happen when machine learning models are optimized for the device hardware and software, and potentially linked to remote systems to provide smart functions, alerts and user adaptation. The full-stack applications session will cover uses of tinyML in such systems comprising hardware, software, sensing and backend operations to deliver compelling user experiences

Session Moderator: Gian Marco IODICE, Team and Tech Lead in the Machine Learning Group, Arm

Transforming sensors at the edge

Jonathan PEACE, CTO, InferSens Ltd

Abstract (English)

Until now, embedded DL at the edge has been limited to simple Neural Networks such as CNNs.

InferSens is focused on leveraging new silicon architectures to enable more sophisticated models at the edge, that are able to unlock new meaning from sensor data while operating on a nanoPower budget compatible with a multi-year battery life.

This talk presents a real-world application of such networks to help automate the manual monitoring of water systems which is done in hundreds of millions of properties worldwide to prevent outbreaks such as Legionnaire’s disease. Our non-invasive and easy to deploy flow sensors infer flow and monitor temperature in real-time.

Autonomous Nano-UAVs: An Extreme Edge Computing Case

Daniele PALOSSI, Postdoctoral Researcher, IDSIA & ETH Zurich

Abstract (English)

Nano-sized autonomous unmanned aerial vehicles (UAVs), i.e., nano-UAVs, are compelling
flying robots characterized by a stringent form factor and payload, such as 10 cm diameter
and a few tens of grams in weight. These fundamental traits challenge the onboard sensorial
and computational capabilities, allowing only for limited microcontroller-class computational units (MCUs) mapped to a sub-100 mW power envelope. At the same time, making a UAV fully autonomous means to fulfill its mission with the only resources available onboard, i.e., avoiding any external infrastructure or off-board computation.
To some extent, achieving such an ambitious goal of a fully autonomous nano-UAV can be
seen as the embodiment of the extreme edge computing paradigm. The computational/memory limitations are exacerbated by the paramount need for real-time
mission-critical execution of complex algorithms on a flying cyber-physical system (CPS).
Enabling timely computation on a nano-UAV ultimately would lead to i. new application
scenarios, otherwise prevented for bigger UAVs, ii. increased safety in the human-robot
interaction, and iii. reduced cost of versatile robotic platforms.
In recent years, many researchers have pushed the state-of-the-art in the onboard
intelligence of nano-UAVs by leveraging machine learning and deep learning techniques as
an alternative approach to the traditional and computationally expensive, geometrical, and
computer vision-based methods. This talk presents our latest research effort and
achievements by delivering holistic and vertical-integrated solutions, which frame
energy-efficient ultra-low power MCUs with deep neural network vision-based algorithms,
quantization techniques, and data augmentation pipelines.
In this presentation, we will follow our two keystone works in the nanorobotics area, i.e., the
seminal PULP-Dronet [1,2] and the recent PULP-Frontnet [3] project, as concrete examples
to introduce our latest scientific contributions. In detail, we will address several fundamental research questions, such as “how to shrink the number of operations and memory footprint of convolutional neural networks (CNNs) for autonomous navigation” [4], “how to improve the generalization capabilities of tiny CNNs for human-robot interaction” [5] and “how to combine the CPS’ state with vision-based CNNs for enhancing the performance of an autonomous nano-UAV” [6]. Finally, we will support our key findings with thorough in-field evaluations of our methodologies and resulting closed-loop end-to-end robotic demonstrators.

[1] Palossi, Daniele, Antonio Loquercio, Francesco Conti, Eric Flamand, Davide Scaramuzza, and
Luca Benini. “A 64-mW DNN-based visual navigation engine for autonomous nano-drones.” IEEE
Internet of Things Journal 6, no. 5 (2019): 8357-8371.
[2] Palossi, Daniele, Francesco Conti, and Luca Benini. “An open source and open hardware deep
learning-powered visual navigation engine for autonomous nano-UAVs.” In 2019 15th International
Conference on Distributed Computing in Sensor Systems (DCOSS), pp. 604-611. IEEE, 2019.
[3] Palossi, Daniele, Nicky Zimmerman, Alessio Burrello, Francesco Conti, Hanna Müller, Luca Maria
Gambardella, Luca Benini, Alessandro Giusti, and Jérôme Guzzi. “Fully onboard ai-powered
human-drone pose estimation on ultralow-power autonomous flying nano-uavs.” IEEE Internet of
Things Journal 9, no. 3 (2021): 1913-1929.
[4] Lorenzo Lamberti, Vlad Niculescu, Michał Barciś, Lorenzo Bellone, Enrico Natalizio, Luca Benini
and Daniele Palossi. “Tiny-PULP-Dronets: Squeezing Neural Networks for Faster and Lighter
Inference on Multi-Tasking Autonomous Nano-Drones.” In 2022 IEEE 4th International Conference on
Artificial Intelligence Circuits and Systems (AICAS), IEEE, 2022.
[5] Cereda, Elia, Marco Ferri, Dario Mantegazza, Nicky Zimmerman, Luca M. Gambardella, Jérôme
Guzzi, Alessandro Giusti, and Daniele Palossi. “Improving the Generalization Capability of DNNs for
Ultra-low Power Autonomous Nano-UAVs.” In 2021 17th International Conference on Distributed
Computing in Sensor Systems (DCOSS), pp. 327-334. IEEE, 2021.
[6] Cereda, Elia, Stefano Bonato, Mirko Nava, Alessandro Giusti, and Daniele Palossi. “Vision-State
Fusion: Improving Deep Neural Networks for Autonomous Robotics.” arXiv preprint arXiv:2206.06112
(2022).

11:00 am to 11:30 am

Coffee & Networking

11:30 am to 12:30 pm

Full Stack Solutions session - Part II

Session Moderator: Gian Marco IODICE, Team and Tech Lead in the Machine Learning Group, Arm

Full-stack neuromorphic, autonomous tiny drones

Federico PAREDES-VALLES, PhD Candidate, MAVLab, Delft University of Technology

Abstract (English)

Neuromorphic sensing and processing hold an important promise for creating autonomous tiny drones. Both promise to be lightweight and highly energy efficient, while allowing for high-speed perception and control. For tiny drones, these characteristics are essential, as these vehicles are very agile but extremely restricted in terms of size, weight and power. In my talk, I present our work on developing neuromorphic perception and control for tiny autonomous drones. I delve into the approach we followed for having spiking neural networks learn visual tasks such as optical flow estimation. Furthermore, I explain our ongoing effort to integrate these networks in the control loop of autonomously flying drones.
Schematic summary of the presentation:
– Introduction: tiny drones, specifications and constraints
– How to make tiny drones autonomous? Drawing inspiration from nature
– Neuromorphic sensing and computing
o Event cameras and spiking neural networks
o Goal: Fully neuromorphic, vision-based autonomous flight
o Work 1: Vertical landings using event-based optical flow
o Works 2 and 3: Unsupervised and self-supervised learning of event-based optical
flow using spiking neural networks
o Work 4: Neuromorphic control for high-speed landings
– Conclusion

tinyML EMEA_Federico Paredes Valles

Low power, low latency multi-object tracking and classification using a fully event-driven neuromorphic pipeline

Nogay KUEPELIOGLU, Assistant Neurmorphic Machine Learning Engineer, SynSense AG

Abstract (English)

Detection, tracking and classification of objects in real-time from a video is a very important but costly problem in computer vision. Typically, this is achieved by running a window based CNN and sliding it over the full image to obtain the areas of interest (i.e. location of the object) and then identifying it. This approach requires the image to be stationary such that the model can be run over the entire object over multiple passes. A second approach involves a one shot one pass model such as YOLO, where the entire image is passed and the localization and identification is done simultaneously. While this approach is better for real-time systems compared to the former due to its requirements of multiple passes, they typically require a lot of memory and thus consume a lot of power to achieve this. Furthermore, previous work along these lines also demonstrated that it is challenging to achieve good accuracy using spiking convolutional neural networks for such a model. The problem becomes more costly to solve due to the state-holding nature of spiking neurons and the requirement of using memory to store those states.

Locating and identifying the objects are two inherently different problems and if they are solved separately this might reduce the network size significantly. This is partially achieved in the field of neuromorphic engineering by the use of dynamic vision sensors, which only pass the information on the change in luminosity as events for individual pixels with corresponding timestamps rather than passing an entire frame. This eliminates the background which is unrelated for the objects to be tracked. These events can easily be clustered into separate clusters by their spatial locations. In this work, we implement such a clustering algorithm to track the objects, preprocess it such that each cluster has the same output size and then pass these events to a spiking neural network for identification. The pipeline is implemented and tested for tracking of single and multiple objects, as well as identification of these objects on novel DynapCNN™ asynchronous spiking convolutional neural network processor.

tinyML EMEA_Mina Khoei

12:30 pm to 1:30 pm

Lunch & Networking

1:30 pm to 2:30 pm

Demos & Posters

Exploiting activation sparsity for easy model optimization in the Akida processor

Sébastien CROUZET, R&D Engineer, Brainchip

Abstract (English)

Neuromorphic processors offer high processing efficiency, but at the cost of an apparent
technical barrier for the new user. BrainChip’s Akida processor mitigates this obstacle by running standard convolutional neural networks, and by offering transparent and performant quantization and lossless conversion methods. Neuromorphic efficiency is retained because the event–based processor can exploit the sparsity (high number of zero–values) inherently present in CNN activation maps.
Nonetheless, selection and optimization of models to target Akida (or indeed, any dedicated hardware) can seem daunting without expert knowledge. Here we show that i) by applying simple off–the–shelf optimization techniques, such as regularization and pruning, we can significantly accelerate inference at minimal reduction to accuracy; ii) applying these techniques provides a powerful lever to manipulate the speed/accuracy trade–off according to task requirement; iii) this in turn reduces the requirement to maintain an extensive library of pretrained models to cover a range of task complexities – instead, a
handful of pretrained models allows a continuum of final model sizes.
[Method] We select a representative and well–known benchmark task, visual wakewords (VWW), starting from a tiny–ML relevant solution, MobileNet V1, alpha=0.25 (as per the tinyML Perf benchmark). We use a transfer learning–based pipeline, with imagenet pretraining, and quantization of both weights and activations to 4–bits (first layer 8–bit weights), 5 epochs of fine tuning on VWW. We test 2 techniques to optimize performance on our hardware: i) activity regularization added during training (mix of L1 and L2 loss), testing model accuracy and sparsity as a function of the amplitude of regularization ii) structured pruning (whole filters), using a well–known metric for filter selection (batch
normalization gamma). We first test the “prunability” of each layer individually by evaluating model accuracy (without re–training) as a function of % filters pruned. We then test several pruning strategies for the whole model (fixed % accuracy loss per layer; targeted pruning of the high–cost layers; …).
[Results] We first introduce a minor network optimization from the original MobileNet based on known properties of Akida: for low filter numbers (<=64), the efficient implementation of standard convolutional layers in Akida means that they run as fast as an equivalently sized separable convolution, while offering a moderately increased representational capacity and thus accuracy. Specifically, the first 3 separable conv layers of MobileNet v1 are swapped for standard conv layers, forming what we will call
“AkidaNet”. With quantization at 4–bits, and using the alpha=0.25 network, we obtain a baseline accuracy of 83.3% on the VWW validation set, with single–sample (batch–size = 1) processing rate of 66 FPS (latency 15.15 ms). Applying a combination of regularization and pruning to optimize this model, we can accelerate processing to 133 FPS while maintaining >80% accuracy. We then test optimization from larger starting models (alpha=0.5 or even 1.0) and find that we can reduce models sufficiently to achieve processing rates similar to the baseline, but with higher accuracy (a=0.5, 84.1% acc. at 84 FPS; a=1.0, 84.8% acc. at 69 FPS)

In-sensor tinyML solutions improve vehicle efficiency and safety

Michele FERRAINA, Senior Application Engineer, STMicroelectronics

Abstract (English)

Increasingly automated and fully electric, tomorrow’s vehicles will integrate complex sensing systems featuring de–centralized and specialized computing solutions with AI skillsets for taking actions autonomously. To ensure an optimized, energy–efficient solution, the required AI workloads need to be distributed across all the involved electronic components. This means that in today’s domain–based architectures, each component must contribute to the car’s overall intelligence. A scalable and effective approach is to use sensors embedding a machine learning (ML) core.
To achieve this synergy, ST recently released a unique high–accuracy, ultra–low–power 6–
axis sensor equipped with ST’s proprietary machine learning core. An IMU with an in–sensor engine can be trained using the UNICO toolset, run a ML algorithm, and quickly trigger an action when a specific event is detected.
Thanks to this advanced technology, vehicle optimization starts with a tiny in–sensor
machine learning core that improves car safety and reduces energy consumption for a more
sustainable solution.

Accurate Estimation of the CNN Inference Cost within Microcontrollers

Thomas GARBAY, Ph.D Student, Sorbonne University - LIP6

Abstract (English)

A Convolutional Neural Network’s (CNN) accuracy implies significant inference costs: an important energy consumption, a high latency, and a significant memory footprint. The Tiny Machine Learning (TinyML) paradigm aims at bringing machine learning inference to ultra-low-power devices like microcontroller units (MCUs). Estimating the CNN’s inference cost within a given MCU is a design key. We introduce an estimation method for the
energy consumption, the latency, and the memory space required by a CNN within MCUs. The CNNs investigated are LeNet5 and ResNet8 within several MCUs. On a range of 14 frequencies, the proposed method shows an average estimation error of 6.91% on energy consumption, 8.79% on latency, and 3.93% on the required memory space.

NimbleAI: Towards Next-generation Integrated Sensing-Processing Neuromorphic Endpoint Devices

Xabier ITURBE , Senior research engineer and EU project coordinator, Ikerlan

Abstract (English)

The Horizon Europe NimbleAI research project brings together 19 EU and UK partners for three years starting in October 2022. NimbleAI leverages key principles of energy-efficient visual sensing and processing in biological eyes and brains, and harnesses the latest advances in 3D stacked silicon integration, to create an integral sensing-processing neuromorphic architecture that efficiently and accurately runs computer vision algorithms in resource- and area-constrained endpoint chips. The rationale behind the NimbleAI architecture is: sense only data with high value potential and discard data as soon as they are not useful for the application – well before entering the deeper layers of the 3D processing architecture. The NimbleAI sensing-processing architecture is to be specialized after-deployment by exploiting optimal application-specific system-level trade-offs to amplify efficiency when running each particular computer vision algorithm in a given context. The target objectives of the NimbleAI chip are: (1) 100x performance per mW gains compared to state-of-the-practice solutions (i.e., CPU/GPUs processing frame-based video); (2) 50x processing latency reduction compared to CPU/GPUs; (3) energy consumption in the order of tens of mWs; and (4) silicon area of approx. 50 mm2.

Efficient Convolutional Neural Network for Aerial Image Classification for Drone-Based Emergency Monitoring

Christos KYRKOU, Research Lecturer, KIOS CoE, University of Cyprus

Abstract (English)

Deep learning–based algorithms can provide state–of–the–art accuracy for remote sensing technologies such as unmanned aerial vehicles (UAVs)/drones, potentially enhancing their remote sensing capabilities for many emergency response and disaster management applications. In particular, UAVs equipped with camera sensors can operating in remote and difficult to access disaster–stricken areas, analyze the image and alert in the presence of various calamities such as collapsed buildings, flood, or fire in order to faster mitigate their
effects on the environment and on human population. However, the integration of deep learning introduces heavy computational requirements, preventing the deployment of such deep neural networks in many scenarios that impose low–latency constraints on inference, in order to make mission–critical decisions in real time.
Furthermore, there is a unique set of constraints that need to be addressed due to the fact that a UAV has to operate in disaster–stricken areas where remote communication to cloud services may not be possible to be established and high–end infrastructure is not available. The autonomous operation of UAVs by combining path planning algorithms with automated on–board machine learning processing enables them to cover a larger area in less time. However, on–board processing comes with its own set of challenges due to the limited
computational resources and low–power constraints which are necessitated by the low–payload capabilities of UAVs. As such, the computational efficiency of the underlying machine learning algorithm plays a key role in enabling autonomous UAVs to detect hazards at incident scenes in real time. Therefore, there are still gaps in the literature when it comes to real–time natural disaster detection.

Tiny-PULP-Dronets : Squeezing Neural Networks for Faster and Lighter Inference on Multi Tasking Autonomous Nano-Drones

Lorenzo LAMBERTI, PhD Student, University of Bologna

Abstract (English)

Pocket–sized autonomous nano–drones can revolutionize many robotic use cases, such as visual inspection in narrow, constrained spaces and ensure safer human–robot interaction due to their tiny form factor and weight – i.e., tens of grams. This compelling vision is challenged by the high level of intelligence needed aboard, which clashes against the limited computational and storage resources available on PULP (parallel ultra–low–power) MCU class navigation and mission controllers that can be hosted aboard. This work moves from PULP–Dronet, a State–of–the–Art convolutional neural network for autonomous avigation on nano–drones. We introduce Tiny–PULP–Dronet: a novel methodology to squeeze by more than one order of magnitude model size (50× fewer parameters), and number of operations (27× less multiply–and–accumulate) required to un inference with similar flight performance as PULP–Dronet. This massive reduction paves the way towards affordable multitasking on nano–drones, a fundamental requirement for achieving high–level intelligence.

A Multi-label Time Series Classification Approach for On-Sensor Non-intrusive Water End-use Monitoring

Kleanthis MALIALIS, Research Associate, University of Cyprus

Abstract (English)

Introduction. As water scarcity is increasingly affecting the world, the development of new strategies focusing on water conservation has become crucial. To this end, investments were made in the last years in data and information technologies to facilitate the use of smart water meters for domestic use. Smart metering of domestic water consumption to continuously monitor the usage of different water fixtures and appliances has been shown to have an impact on people’s behaviour towards water conservation [1]. Currently, smart water meters are available with basic low–power processing capabilities, which are used for transmitting over a period of multiple years, measurements and simple alerts. We explore the use of machine learning embedded in the smart meters, to enhance the current level of capabilities, for online appliance classification. In this work, we consider the problem of non–intrusive water usage monitoring in households as our case study, using on–sensor (water meter) real–time classification and monitoring [2]. We address this by identifying active water consuming fixtures and appliances using only data from the main water flow meter, measuring the total household consumption. Our proposed methodology does not require a–priori identification of events, and is considered for the first time, to the authors knowledge.

Towards Universal 1-bit Weight Quantization of Neural Networks with end-to-end deployment on ultra-low power sensors

Le MINH TRI, Ph.D. student in deep learning, TDK InvenSense

Abstract (English)

Machine learning has proven its strength and versatility in both academia and the industry, and is still expected to expand its exciting growth for the foreseeable future. Though, not without challenges laying at its frontier, machine learning has met the large world of tiny hardware. In the case of MEMS, embedded hardware has distinct objectives and challenges: both memory and computation constraints (integers, unsupported ops, …), operates always–on with limited latency and power usage, yet it must process sensor data efficiently and give expected outputs. One of the biggest and first challenges of tinyML is to find feasible solutions that meet the high constraints of low–power devices without losing the full–scale performance. Thus, effective quantization both in terms of preserving the original performance and hardware integrability is a critical step towards robust tinyML. Hence, we favor methods that can easily integrate into common low–power hardware over fancier algorithms that need building non–trivial support. Currently, at TDK InvenSense we have an end–to–end tinyML platform that enables training and deployment of machine learning models (decision trees, standard neural networks: CNN, RNNs, Dense, and various activations) on any low–power MCU platform for sensor applications. Neural network weights are quantized to 8–bits and run with full integer inference. For our sensor applications (motion activity, audio detection, …), we can get ultra–low power–friendly
models that get deployed with less than ~10kB total (code + model), low latency, and power
consumption. We found that our 8–bits quantization loss is consistently close to its floating–point counterpart (<1–2% accuracy in general), which means we can go further!
To improve our quantization, we focus on a promising 1–bit weight quantization approach for neural networks that optimizes the model and the weights at the same time during training. It has a low training overhead and is hassle–free, scalable, and automatable. We show that this method 2 can generalize to n–bits quantization, granted sufficient int–n support is available on the edge device. This algorithm is model–agnostic and can integrate into any training phase of the TinyMLOps workflow, for any application. We show that almost all weights converge to binary values or close to them and that the large majority stabilizes during training. We will also explore binary quantization tradeoffs and challenges: binary model size vs full–scale performance, binary rounding problem, hybrid quantization, We test our binary quantization algorithm on MNIST and a bring–to–see (face–to–wake) dataset for smartwatches and compare it to related approaches [1], [2], and discuss limitations. We found minimal accuracy loss (max. 2–3%) and very robust results over independent repetitions.

Transient Fault Analysis of Resource Constrained Neural Inference on tinyTPU

Panagiota NIKOLAOU, Research Associate, KIOS CoE, University of Cyprus

Abstract (English)

The use of Neural Networks (NN) in a variety of critical applications, has led to the development of new accelerators driving the computation towards the Edge. However, the Edge consists of highly constrained resources with limited power budget and area footprint. On the other hand, in critical applications such as wearable biomedical diagnostics and unmanned aerial and terrestrial vehicles [1],[2], reliability is becoming excessively crucial, trading–off area and power due to extra protection for fault–coverage, something not
affordable to highly constrained edge devices. However, protection in some cases is inevitable because transient errors can cause catastrophic events if they occur in specific parts of the edge–based critical applications by causing silent data corruption (SDC) errors or application crashes [7]. Thus, selective protection can significantly reduce area and power overheads with minimal impact on system’s reliability
[3]. To enhance selective protection for the most vulnerable modules of a system, fine–grained vulnerability analysis needs to be provided.
To this end, in this work we provide a fine–grained vulnerability analysis of a highly resource constrained edge accelerator architecture by extending the emulation–based Fault Injection Mechanism (FIM) presented in [4],[5] as well as the analysis in [6]. The FIM allows for both coarse– and fine–grained vulnerability analysis using hardware–based models. It considers transient faults and can inject single event upsets (SEUs), as well as multiple event upsets (MEUs) [7]. Each fault injection is executed on prototyped FPGA–based hardware, decreasing substantially the fault campaign’s overall execution time.
Our vulnerability analysis focuses on SEUs leading to SDC errors in the classification output of the NN inference. SDC errors happen when there is an output discrepancy between the fault–free and faulty execution and can manifest differently according to their exact impact on the classification outcome (correct or not). In particular, our analysis provides an error outcome taxonomy that categorizes SDCs into (i) tolerable errors which affect a part of the output (confidence level) but do not affect the actual classification decision, (ii) critical errors which convert the correct classification to an incorrect one, (iii) no
impact errors that affect the missclasified output to a different misclassification, and (iv) beneficial errors that have a positive impact by changing the wrong classification output to a correct one. The evaluation considers the tinyTPU architecture [8], a systolic array architecture geared towards edge devices that resembles Google’s TPU [9], while executing the MNIST dataset [10]. The dataset uses 28 X 28– pixel gray scale images of handwritten digits from 0 to 9 as input and includes a set of 60000 images for the training phase and a set of 10000 images for the testing phase which we use during the evaluated
inference phase. The considered tinyTPU architecture is given in Figure 1. It consists of ten main components: Unified Buffer (UB), Systolic Array Control (SAC), Control Weight (WBctrl), Control MMU (MMUctrl), Control Activation (ACTctrl), Control Coordinator (COORDctrl), Weight Buffer (WB), Matrix Multiply Unit (MMU), Accumulator unit (ACC) and Activation unit (ACT). The fault injection mechanism provides vulnerability analysis using hardware–based models for fault injection which are integrated with the system to be evaluated, as shown in Figure 1. Fault injection experiments are applied either at the
entire system–level or at the component– or register–level for finer–grained evaluation. Each granularity level has its dedicated randomization mechanism, which allows to appropriately select when and where to inject a fault. The vulnerability of the ten components is evaluated by injecting a total of 138,127 SEUs, providing a statistically significant fault campaign. Figure 2 shows the Architectural Vulnerability Factor (AVF) of each component for the different SDC categories in logarithmic scale. In Figure 2, AVF denotes the probability
to have an SDC error. COORDctrl causes only application crashes, thus, this component is excluded from the presented results. Figure 2 indicates that the component with the highest AVF for critical SDCs is the WB. This happens because all the data (weights) are stored in the WB from the beginning of an execution, therefore, an error in this component will persist. In other words, once a bit–flip occurs in WB, it affects all the remaining images. Still, many SEUs are shown to be tolerable (TCC and TMC in Figure 2) or of no impact, and some are shown to be beneficial.
In this work we also extend the analysis to provide performance–accuracy trade–offs using various smaller NN models and provide a transient error mitigation method for the critical WB component using periodic memory refresh. Furthermore, we show how to considerably reduce the necessary emulation time for the vulnerability analysis in the tinyTPC components when bit–flips from SEUs are non–persistent

Automatic-Structured-Pruning

Marcus RUB , Research assistant, Hahn-Schickard

Abstract (English)

We have implemented a framework to support developers for an automatic structured pruning of their neural networks. In previous pruning frameworks, only weight pruning is applied here. Although the unnecessary weights are set to zero, memory must still be provided for these weights, which does not result in a reduction of the weights. Thus, no reduction of the required memory space is achieved by this either. For this reason, the Automatic–Structured–Pruning tool was developed. This tool makes it possible to
perform pruning for convolutional and fully cross–linked layers. Individual filters or neurons are deleted directly from the respective layers. This results in a reduction of the weights as well as the memory requirements.
The tool allows two different pruning approaches:
– Factor: For the fully connected and convolutional layers, a factor is specified in each case, which indicates the percentage of neurons or filters to be deleted from the layer.
– Accuracy: The minimum accuracy or the maximum loss of accuracy is specified. This defines which accuracy the neural network should still reach after pruning.
The framework is still under development and will be extended from time to time.
Additionally, the Automatic–Structured–Pruning is part of the Tool
[AutoFlow](https://github.com/Hahn–Schickard/AutoFlow), which is a tool that helps developers to implement machine learning (ML) faster and easier on embedded devices. The whole workflow of a data scientist should be covered. Starting from building the ML model to the selection of the target platform to the optimization and implementation of the model on the target platform.

Bringing face-swapping to the very edge with Phinets

Alberto ANCILOTTO, Researcher, Fondazione Bruno Kessler (FBK)

Abstract (English)

Face swapping is a novel AI technique used to preserve user privacy in video analytics applications. It is based on the generation of images with synthetic faces in place of the original ones, thus removing sensitive information in the frame. Moreover, performing face swapping at the edge enables an extra layer of security in the data streaming pipeline. This contribution proposes our TinyML approach based on PhiNets applied to Generative Adversarial Networks (GAN) for face swapping. In particular, we targeted a resource-constrained platform based on a low-power microcontroller, a Kendryte K210, with a RISC V dual core working at 400MHz and overall consuming less than 300mW.

2:30 pm to 3:30 pm

Algorithms, Software & Tools session - Part I

Bringing highly efficient inference models to real devices requires innovation in key areas of network architecture, tool chain and embedded SW to optimise performance and energy.

The Algorithms, SW & tools session will explore some of the recent advancements in the field, while uncovering some of the challenges we are facing as an industry.

Session Moderator: Martin CROOME, Vice President Marketing, GreenWaves

Wavelet Feature Maps Compression for Image-to-Image CNNs

Eran TREISTER, Assistant Professor, Ben Gurion University

Abstract (English)

Convolutional Neural Networks (CNNs) are known for requiring extensive computational
resources, and quantization is among the best and most common methods for compressing
them. While aggressive quantization (i.e., less than 4-bits) performs well for classification, it
may cause severe performance degradation in image-to-image tasks such as semantic
segmentation and depth estimation, where a high resolution is required for obtaining high
accuracy. In this work, we propose Wavelet Compressed Convolution (WCC)—a novel
approach for high-resolution activation maps compression integrated with point-wise
convolutions, which are the main computational cost of modern architectures. To this end,
we use an efficient and hardware-friendly Haar-wavelet transform, known for its
effectiveness in image compression, and define the convolution on the compressed activation map. We experiment on various tasks, that benefit from high-resolution input, and by combining WCC with light quantization, we achieve compression rates equivalent to 1-4bit activation quantization with relatively small and much more graceful degradation in
performance.

Performing object detection on constrained devices

Louis MOREAU, Senior DevRel Engineer, Edge Impulse

Abstract (English)

In this session, we will go over the different existing image processing approaches used in
tinyML, from simple and lightweight image classification techniques to more complex and
compute-intensive ones such as object detection or image segmentation, are typically not suitable for constrained devices. We will then introduce a new technique that can be used to perform object detection on constrained devices: FOMO. FOMO (Faster Objects, More Objects) is a novel open-source machine-learning algorithm that enables object detection on microcontrollers. With this new architecture, you will be able to find the location and count objects, as well as track multiple objects in an image in real-time using up to 30x less processing power and memory than MobileNet SSD or YOLOv5.

tinyML EMEA_Louis Moreau

3:30 pm to 4:00 pm

Coffee and Networking

4:00 pm to 5:30 pm

Algorithms, Software & Tools session - Part II

Session Moderator: Elad BARAM, VP Products, Emza Visual Sense

Bringing face-swapping to the very edge with Phinets

Elisabetta Farella, Lead researcher of the Energy Efficient Embedded Digital Architectures, Fondazione Bruno Kessler

Abstract (English)

Face swapping is a novel AI technique that preserves user privacy in video analytics applications. It is based on the generation of images with synthetic faces in place of the original ones, thus removing sensitive information in the frame. Moreover, performing face swapping at the edge enables an extra layer of security in the data streaming pipeline. This contribution proposes our TinyML approach based on PhiNets applied to Generative Adversarial Networks (GAN) for face swapping. In particular, we targeted a resource-constrained platform based on a low-power microcontroller, a Kendryte K210, with a RISC V dual core working at 400MHz and overall consuming less than 300mW.

tinyML EMEA_Elisabetta_Farella

Whole-model optimization with Apache TVM

Andrew REUSCH, Software Engineer, OctoML

Abstract (English)

Optimized deep learning kernels are crucial for achieving good performance in deployed ML
models. Increasingly, developers are turning to deployment tools to assemble these optimized kernels into full models. However, per-kernel optimizations have limited impact on a full model’s throughput, especially when heterogenous compute platforms are in use or when the underlying hardware is designed to execute operators concurrently.
In this talk, I’ll describe how Relax, a new model-level language in Apache TVM, enables
hardware vendors to easily apply common hardware optimization techniques such as striping and global memory planning across the full program. With Relax, these techniques can be easily tailored towards both individual accelerators and heterogeneous compute environments. I’ll lastly discuss our future plans to integrate Relax with Apache TVM’s Ahead-of-Time compilation flow, making it available in a low-overhead runtime targeted to bare-metal environments.

tinyML EMEA_Andrew Reusch

Sound Classification Model Robustness using Augmentation and Similarity Loss

Dima LVOV, Deep Learning Algorithms Team Leader, Synaptics

Abstract (English)

Deep Neural Networks (DNN) are a popular algorithmic tool for dealing with classification and regression tasks. Sound Event Detection (SED) is a classification task of recognizing sound events captured by a microphone, and their respective temporal start and end points. High performance can be achieved using DNN in controlled lab environments, but if often deteriorates significantly in real-life scenarios including ambient noises, polyphonic environments, and room reverberation. This research presents a novel combination of similarity loss and data augmentation method in a form of pairs-training process, for achieving better robustness to the aforementioned factors. Our experiments show a superiority of the proposed method compared to a regular data augmentation scheme.

tinyML EMEA_Dima Lvov

9:00 am to 9:15 am

Opening

9:15 am to 10:00 am

Keynote by Alberto Sangiovanni Vincentelli

Neuromorphic computing, tools and applications for TinyML

Alberto L. SANGIOVANNI-VINCENTELLI, Professor, The Edgar L. and Harold H. Buttner Chair of EECS, UC Berkeley

Abstract (English)

TinyML requires particular attention to resource constraints such as power consumption, computational and memory limitations. ML is notoriously expensive in the resources that are constrained in tiny devices. I will present a particular approach to neuromorphic computing as an architecture that is suitable for TinyML given the power, memory and computation envelop. I will discuss the tools that could be deployed for this particular architecture first and I will outline the need to combine in a tight loop HW and SW design to obtain an optimal design. Finally, I will present the challenges posed by some applications with particular attention to the automotive industry.

EMEA_Alberto Sangiovanni Vincentelli

10:00 am to 11:00 am

tinyML Ecosystem Development

Towards a thriving TinyML Ecosystem Strategy and Roadmap for EMEA.

11:00 am to 11:30 am

Coffee & Networking

11:30 am to 1:00 pm

Hardware and Sensors

This session will cover innovation and advances within tinyML HW and sensors. Improvement of tinyML applications is dependent on innovation in both the tinyML HW as well as in the sensor technology. For many applications, the sensor and processing will be highly integrated, and make use of new innovative technologies to push the boundaries for power consumption and the necessary compute, both in terms of memory technology and arithmetic design, including the use of analog. New sensors and new hardware push the need for innovation in tooling, for example for novel ways to quantize and prune networks.

Session Moderator: Tomas EDSÖ, Senior Principal Design Engineer, Arm

Event-based sensing and computing for efficient edge artificial intelligence and TinyML applications

Federico CORRADI, Senior Neuromorphic Researcher, IMEC

Abstract (English)

The advent of neuro-inspired computing represents a paradigm shift for edge Artificial Intelligence (AI) and TinyML applications. Neurocomputing principles enable the development of neuromorphic systems with strict energy and cost reduction constraints for signal processing applications at the edge. In these applications, the system needs to accurately respond to the data sensed in real-time, with low power, directly in the physical world, and without resorting to cloud-based computing resources.

In this talk, I will introduce key concepts underpinning our research: on-demand computing, sparsity, time-series processing, event-based sensory fusion, and learning. I will then showcase some examples of a new sensing and computing hardware generation that employs these neuro-inspired fundamental principles for achieving efficient and accurate TinyML applications.

Specifically, I will present novel computer architectures and event-based sensing systems that employ spiking neural networks with specialized analog and digital circuits. These systems use an entirely different model of computation than our standard computers. Instead of relying upon software stored in memory and fast central processing units, they exploit real-time physical interactions among neurons and synapses and communicate using binary pulses (i.e., spikes). Furthermore, unlike software models, our specialized hardware circuits consume low power and naturally perform on-demand computing only when input stimuli are present. These advancements offer a route toward TinyML systems composed of neuromorphic computing devices for real-world applications.

tinyML EMEA_Federico Corradi

Optimizing data-flow in Binary Neural Networks

Lorenzo VORABBI, DL Labs Vision and Processing Methods Team Leader Support, Datalogic

Abstract (English)

Binary Neural Networks (BNNs) can significantly accelerate the inference time of a neural network by replacing its expensive floating-point arithmetic with bit-wise oper-ations. Most existing solutions, however, do not fully optimize data flow through the

BNN layers, and intermediate conversions from 1 to 16/32 bits often further hinder
efficiency (row a of Figs. 1i and 1ii).

(i) VGG style block. (ii) ResNet style block.
Figure 1: a) Standard BNN blocks used in [3] and [4]. b) BNN block with output
convolution clipping used during training. c) Optimized BNN block adopted during
inference. Popcount operation is performed using saturation arithmetic in order to
keep the data width to 8 bits at inference time. BN is replaced in by a comparison in
case i, while in ii BN is 8-bit quantized.
Our contributions (graphically highlighted in Figs. 1i and 1ii) can be summarized
as follows:
• a novel training scheme is proposed to improve the data-flow in the BNN pipeline;
specifically, we introduce a clipping block to shrink the data width from 32 to 8
bits while simultaneously reducing the internal accumulator size (row b of Fig.
1), usually kept using 32-bit to prevent data overflow without losing accuracy.

• we provide an optimization of the Batch Normalization layer that decreases la-
tency and simplifies deployment.

• we present an optimized implementation of the Binary Direct Convolution method
for ARM instruction sets.
To prove the effectiveness of the proposed optimizations we provide experimental
evaluations that show’s the speed-up relative to state-of-the-art BNN engines like LCE
[1] and DaBNN [2] on real hardware devices (Raspberry Pi Model 3B and 4B with
64-bit OS). Our experiments show a consistent improvement of the inference speed, up
to 1.77 and 1.9× compared to LCE and DaBNN, with no drop in accuracy for at least
one full-precision model.

References
[1] Tom Bannink, Adam Hillier, Lukas Geiger, Tim de Bruin, Leon Overweel, Jelmer

Neeven, and Koen Helwegen. Larq compute engine: Design, benchmark and de-
ploy state-of-the-art binarized neural networks. Proceedings of Machine Learning

and Systems, 3:680–695, 2021.
[2] Jianhao Zhang, Yingwei Pan, Ting Yao, He Zhao, and Tao Mei. dabnn: A super fast
inference framework for binary neural networks on arm devices. In Proceedings
of the 27th ACM international conference on multimedia, pages 2272–2275, 2019.

[3] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-
net: Imagenet classification using binary convolutional neural networks. In Euro-
pean conference on computer vision, pages 525–542. Springer, 2016.

[4] Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and Kwang-Ting

Cheng. Bi-real net: Enhancing the performance of 1-bit cnns with improved rep-
resentational capability and advanced training algorithm. In Proceedings of the

European conference on computer vision (ECCV), pages 722–737,

tinyML EMEA_Lorenzo Vorabbi

Architecture-Compiler Co-Optimization of Computing-in-Memory Edge-AI Accelerators

Jan Moritz JOSEPH, Postdoctoral Reseracher, RWTH Aachen University

Abstract (English)

Resistive Random Access Memory (RRAM) is an emerging technology with high potential for many analog and digital applications. RRAM is used as analog elements in neuromorphic circuits [1], as a general non-volatile memory device [2], or as non-volatile logic gates [3]. Non-volatile memories offer further qualitative advantages such as multi-bit capability, CMOS compatibility, low power consumption, smaller size compared with SRAM cells, and faster access speed in the case of ion-based memristors. A memristor-based cell can store not only two (like SRAM cells) but several states.
As a key advantage of RRAM as emerging memory technology, it can offer computations in the memory cell (computing in-memory, CIM). The stored data can be read by an applied voltage and integrated into a resistor network. Hence, it is a data-storing and data-processing device which saves area (from reduced memory bandwidth) and energy (from reduced data movements). Furthermore, it enables low-latency on-device inference.
These advantages of RRAM technology make it a promising post-CMOS technology. It has become a vivid research topic with contributions from academia and industry working on devices, circuits, architectures, systems, and algorithms to enable neuromorphic computing. One of the most important promises of neuromorphic computing using emerging CIM technologies is energy savings for ML applications. These savings make it a perfect candidate for building edge-AI systems.
In the first part of this talk, we will summarize the current state-of-the-art for neuromorphic edge-AI accelerators to motivate their key advantage. We will also discuss this technology’s challenges before mass-market adoption is possible. In the second part of the talk, we will focus on one of these challenges, namely the co-optimization of architectures and compilers. Finally, in the third part, we will introduce our proposed solution.
We are convinced that efficient, widely adopted neuromorphic systems will only be possible if they are integrated into existing edge-AI software stacks. Otherwise, seamless and risk-free migration from existing CMOS technology will not be attractive or realistic for most companies, even though emerging memories offer substantial power savings. Therefore, it is imperative to provide an ecosystem of software and hardware development kits that work with existing edge-ML tools. Even though RRAM technology is still in an early state with a low level of device maturity, this will enable access for early adopters and thus foster the development of neuromorphic computing for the edge AI community.
This talk will introduce our solution to this challenge. We provide a software and hardware
development kit called “Neureka” developed at RWTH Aachen University as part of large-scale neuromorphic computing projects. It consists of the following techniques that will be presented in detail in this talk:

1. Our SDK is built using the LLVM-MLIR compiler framework and thus integrates with the most common AI frameworks such as Tensorflow and Pytorch. It contains custom partitioning, mapping, and scheduling methods to migrate existing edge-ML workload to neuromorphic CIM architectures. We will introduce the compiler flow, show custom quantization for RRAM technology, and discuss its advantages and limitations.
2. We will present our cycle-accurate architecture simulator and virtual platform to enable the co-optimization of compilers and architectures at an early design stage. We will discuss
designs for efficient edge-AI accelerators using RRAM and show our integrated design
method.
3. Our hardware development kit connects conventional development boards (e.g., FPGAs)
with novel neuromorphic chips containing RRAM crossbars. Our board offers a merit of
peripherals necessary to control and stimulate RRAM crossbar arrays with a 16-bit
Analog/Digital Converter (ADC), allowing us to measure currents from 200mA to 0.1mA. The centerpiece of our board represents the large multiplexer matrix that allows for a flexible assignment of the in- and output pins towards a socket to test, program, and use RRAM-based ICs.

4. We will show how early adopters can use our development kit to assess neuromorphic
computing at the edge. We will provide a demonstrator for letter recognition on a
smartwatch, which executes parts of the workload in RRAM.
To summarize, this talk will give a comprehensive overview of neuromorphic computing for edge AI applications. We will provide a solution, called “Neureka”, to enable seamless migration of existing workloads to the novel technology. We are convinced that this approach will be one key element in using novel emerging memories in the Tiny-ML community for researchers and early adopters from industry.
References
[1] E. Covi et al. “A Volatile RRAM Synapse for Neuromorphic Computing”. In: 2019 26th IEEE International Conference on
Electronics, Circuits and Systems (ICECS). 2019, pp. 903–906.
[2] K.-H. Kim, S. Gaba, D. Wheeler, J. M. Cruz-Albrecht, T. Hussain, N. Srinivasa, and W. Lu. “A Functional Hybrid Memristor
Crossbar-Array/CMOS System for Data Storage and Neuromorphic Applications”. In: Nano Letters 12.1 (2012), pp. 389–
395.
[3] M. Biglari, T. Lieske, and D. Fey. “High-Endurance Bipolar ReRAM-Based Non-Volatile Flip-Flops with Run-Time Tunable
Resistive States”. In: Proceedings of the 14th IEEE/ACM International Symposium on Nanoscale Architectures. 2018.
[4] G.W. Burr, et al. “Neuromorphic computing using non-volatile memory.” In: Advances in Physics: X 2.1 (2017): 89-124.
[5] C. D. Schuman, et al. “Opportunities for neuromorphic computing algorithms and applications”. In: Nature
Computational Science, 2(1), (2022).
Schematic Summary of the Content
Part 1: Introduction to neuromorphic computing in edge devices. Demonstration of potential benefits for the Tiny-ML community. Discussion of challenges before mass-market adoption.
Part 2: Requirement for co-optimization of architectures and compilers in neuromorphic edge-AI systems. Specification of frameworks and systems to enable access for early adopters.
Part 3: Our solution, the “Neureka” ecosystem. Introduction to software development kit (compiler on basis of LLVM-MLIR), architectures, and simulators. Demonstration using our hardware development kit for letter recognition on a smart watch.

tinyML EMEA_Jan Moritz Joseph

1:00 pm to 2:00 pm

Lunch & Networking

2:00 pm to 3:15 pm

Demos & Posters

3:15 pm to 3:55 pm

On Device Learning - Part I

To date, the tinyML community has been successful in creating outstanding initiatives which established on-device Machine Learning inference at an industry level as an essential technology.

Looking forward to the ML technology evolution, experts are increasingly asking how to let tiny devices adapt themselves to the variability of the environments in which on-device inference is deployed, thereby compensating for concept and data drifts.

Session Moderator: Danilo PAU, Technical Director, IEEE and ST Fellow, STMicroelectronics

A framework of algorithms and associated tool for on-device tiny learning

Danilo PAU, Technical Director, IEEE and ST Fellow, STMicroelectronics

Abstract (English)

Imagine one needs to detect anomalies in an electro cardio diagram in a robust manner considering working conditions variabilities in terms of carry position and person under monitoring. Consider using heterogeneous sensors needed to monitor a water distribution system to check its reliability. Suppose the need to detect oil leaks using a remote imaging system placed inside a 195 meters toll wind turbine in the Walney Extension located in the Irish Sea. Or consider the urgent need to monitor adversarial environments influenced by sudden climatic changes that can put at risk human lives. Finally consider applying adaptivity to field motor control.

At a first sight these cases seem not to have anything in common, however at a deeper look they have the same needs to address since examples of cyber-physical systems. They shall manage variable data distributions, learn from them since they can be hardly modelized in laboratory, deploy distributed artificial intelligence closer to the physical world either for sensing and for actuation, automatizing anomaly detections and more importantly generate automatically high-level notifications through the cyber world with intermediate processing inside the object world.

And then anyone starts to understand how important it is for the tiny machine learning community to develop intelligent technologies based on the capability to learn from constantly changing environments and on the field by using tiny, embedded devices to realize systems that can be personalized, easily and massively deployable throughout the world.

Today anyone is used to owning sensors. Any IoT or mobile device anyone owns has got sensors (and may be actuators) inside which we buy. However, this approach already limits the market opportunities, innovation, and conception of new applications for humanity to benefit. Indeed, in an interview dated on Sept 14, 2015, world renown University of California at Berkeley professor Alberto Sangiovanni-Vincentelli affirmed “In the near future, we are going to have a sensory swarm, a great deal, a great variety of all kinds of heterogeneous sensors that are going to interface the cyber world, the computing world, with the physical world”. And he added “Sensors will be immersed in the environment“.

All above and Alberto’s vision urges the research community to address the goal the TinyML working group set for On Device Learning (ODL) which is defined as follows: … to make edge devices “smarter” and more efficient by observing changes in the data collected and self-adjusting / reconfiguring the device’s operating model. Optionally the “knowledge” gained by the device is shared with other deployed devices.

In this talk, the technology vision mentioned above will be elaborated from a system and requirements point of view, some of its machine learning and statistical components will be discussed with specific reference to the algorithms and their implementation on micro- controllers. A common framework of approaches will be introduced since they are horizontal technologies that can be specialized for each of the above use cases. An example of an industry tool that helps to modelize these algorithms will be provided. Further readings will be provided to deepen the topic.

tinyML EMEA_Danilo Pau

In Sensor and On-device Tiny Learning for Next Generation of Smart Sensors

Michele MAGNO, Head of the Project-based learning Center, ETH Zurich, D-ITET

Abstract (English)

TinyML encompasses technologies and applications including hardware, algorithms and software capable of performing on-device sensor data analytics at extremely low power, typically in the mW range and below. It’s an exciting area that is pushing IoT developers to build new AI-powered features for intelligent sensors and MCU devices and applications. The process of TinyML involves using neural models that have either been pre-trained by the developers, or that developers train themselves to equip the IoT applications intelligent capability. This is today well consolidated in areas such as Computer Vision, Natural Language Processing, condition monitoring activities recognition etc.

Today the majority of TinyML models are trained on powerful servers in the Cloud and the learning process does not target the limited resources of an IoT devices.

Current TinyML models are inferred directly on an embedded processor (e.g. in a mobile processor or a low power microcontroller). The TinyML model processes input data—like images, text, or audio—on-device rather than sending that data to a server and performing the processing there.

Innovations like the TensorFlow Lite framework or X-Cube-AI make possible to run pre-trained inferences on microcontrollers as those models require a small memory footprint and millions of operation per seconds. On the other hand, today TinyML has mainly focused on a train offline -then deploy assumption, with static models that cannot be adapted to newly collected data without cloud-based data collection and finetuning. Therefore those models are not ronust against concept drift which can reduce significantly the accuracy of a TinyML workload in time varying data conditions. This limit reduces significantly the flexibility and overall scores in real application scenario at the point that frequent offline re training can be required.

A new research trend is based on Continuous Learning (CL) which enables online, serverless adaptation, in principle. However, due to computation and memory-hungry requirements of current techniques, they are more suitable for smart phone deployment than low power micro or milliwatt devices.

Thus, there is a huge opportunity to improve those techniques and optimize them for ultra low power tiny devices that can be deployed in-sensors such as the STMicroelectronics ISPU or ARM Cortex-M4 cores.

This talk will present preliminary promising results on CL algorithms running in-sensors and in ARM Cortex-M microcontrollers for audio and 6-axes inertial sensors data in-sensor learning starting from 6-axes. We will demonstrate the benefits of CL by exploiting the capability of ARM Cortex DSP and STMicrocontroller ISPU in sensors.

tinyML EMEA_Michele Magno

3:55 pm to 4:15 pm

Coffee & Networking

4:15 pm to 4:55 pm

On-Device Learning - Part II

Session Moderator: Danilo PAU, Technical Director, IEEE and ST Fellow, STMicroelectronics

Continual On-device Learning on Multi- Core RISC-V MicroControllers

Manuele RUSCI, Embedded Machine Learning Engineer, Greenwaves

Abstract (English)

In recent years, the combination of novel hardware technologies and optimized software
tools have contributed to the success stories of many Deep Learning (DL) powered
MicroControllers (MCUs) based sensors. “Train-once-deploy-everywhere” has resulted in
the most widespread design paradigm: DL models are initially trained on pre-collected
data using High Performance Computing facilities and then deployed on MCUs at scale.
However, this approach reports severe weakness when real-world data differs from
training data, causing online mispredictions and failures on deployed sensors. To
overcome this issue, future smart sensor devices are expected to continuously adapt their
local Deep Learning models, i.e. update the coefficients, based on the data coming from
the ever-changing environment. This demands for on-device adaptation capabilities,
which belong to the Continual Learning domain, currently not in-the-scope of TinyML
devices.
In this context, this talk will firstly review the system architecture needs to bring efficient
Continual Learning methods on low-power sensor platforms. With a focus on multi-core
RISC-V MCU systems, we analyze the trade-off between memory, energy consumption
and accuracy of back-propagation based learning techniques that makes use of Latent
Replays (LRs) to prevent the catastrophic forgetting phenomena. More in details, we
show how to effectively map Latent Replay based learning on a multi-core device and
how to reduce the LR memory requirement by means of low-bitwidth quantization.
In the second part of the talk, we focus on the on-device learning task and we present
PULP-TrainLib, the fastest compute library to run backpropagation based learning on
MCUs. The library includes a set of parallel software DNN primitives for the execution of the forward and backward steps. Results on an 8-core RISC-V MCU show that our auto-
tuned primitives improve MAC/clk by up to 2.4× compared to “one-size-fits-all” matrix multiplication, achieving up to 4.39 MAC/clk – 36.6× better than a commercial STM32L4
MCU executing the same DNN layer training workload. Furthermore, our strategy proves
to be 30.7× faster than AIfES, a state-of-the-art training library for MCUs, while training a
complete TinyML model.

Lastly, the talk will conclude the talk by discussing future directions and open challenges for on-device learning on the MCU domain from an application perspective.

tinyML EMEA_Manuele Rusci

On-device continuous event-driven deep learning to avoid model drift

Bijan MOHAMMADI, CSO, Bondzai

Abstract (English)

Keywords: Artificial Intelligence IoT (AIoT), on-device learning, model drift, event-driven
vs. intent-driven learning.

When the model performances degrade, intent-driven solutions based on TinyML
are attractive to ease the account of new data by the models, testing and flashing the
models onto devices. These solutions tend to answer the difficulty for embedded
engineers to maintain the models up-to-date. But, intent-driven solutions are bound
by necessary user interventions, back and forth between remote learning and testing
and deployment and the induced latencies. We need to remove the necessity for the
user to proceed with data collection and architecture new trainings. Data-driven
should be ‘event’-data driven, not ‘intent’.

Davinsy, Bondzai’s autonomous machine learning system, continuously learns from
real-time incoming data. This strictly removes the possibility of model drifts with
respect to the data. This is a ‘must-have’ when targeting industrial applications
where variability is the rule not the exception making the consideration of real live
data through an event-driven architecture necessary without user intervention.

One important situation which rules industrial applications is that errors in learning
data can appear, even if initially absent, as a given system functioning and
environment can change over time. This can be because the system ages or simply
because its noisy environment changes. This means, again, that error in data should
be thought of not as an exception but as the rule. Consider a voice command device in a noisy environment, such a change makes that a ML model built on a static dataset will necessarily drift and this is regardless of how large the dataset is. Continuous learning is the only way to keep ML models not drifting by continuously adapting them to the current environment. When the environment is changing fast, this needs to be done with zero latency which is clearly incompatible with any intent-driven, remote learning, with separate testing and on-device evaluation strategies. User interventions or iterative procedures are off the table.
A consequence of event-driven on-device learning is that bias in data will solely
come from failures in the local acquisition device not propagating outside. On the
other hand, if a dataset is biased, every model built using this set will be contaminated. As a consequence, event-based on-device continuous learning not only avoids model drift, but it also reduces the impact of data bias.
Davinsy considers data stream (i.e., a continuous flow of incoming data which can
be IMU, audio, image and so forth). Davinsy exploits a new concept of Virtual
Models as a new programming method proposed in the Application Services Layer
(ASL) to override the inner AI engine called DALE (Deeplomath Augmented Learning
Engine) and achieve the polymorphism in the created local models. If the structure changes and requires adapting the model to a new situation, those models will be re- built automatically by learning directly the structure extracted by the data pre-
processing functions that are available in Davinsy library. The system is a cross-
platform software interfaced with most of the embedded environments. Davinsy is built around few major functional blocks and an innovative AI Engine called DALE.

We will illustrate the ingredients above through voice-control applications
with Davinsy Voice which is an instantiation of Davinsy system for voice activated
front-end systems. The voice application is based on “speech to intent” principle and
packaged as a cross-platform library. Implementations are available on different host
platforms such ST Microelectronics IoTNode and Sensortile evaluation boards and
Raspberry Pi3B+.

4:55 pm to 5:15 pm