tinyML EMEA Innovation Forum - June 24-26, 2024

Amplifying Impact – Unleashing the Potential of TinyML

June 24-26, 2024

About

The tinyML EMEA Innovation Forum is accelerating the adoption of tiny machine learning across the region by connecting the efforts of the private sector with those of academia in pushing the boundaries of machine learning and artificial intelligence on ultra-low powered devices.

Venue

Allianz MiCo • Milano Convention Centre

Piazzale Carlo Magno 1/Gate 16, 20149 Milano MI Italy

Contact us

Rosina Haberl

The 2024 program will highlight:

  • Impact Across Industries
  • Impactful Technical Advancements
  • Positive Change Track: TinyML for Good
  • Distributed Solution
  • Standardization and Benchmarks
  • Collaborative Impact Initiatives
  • Innovation Showcase

 

Schedule

8:00 am to 9:00 am

Registration

9:00 am to 9:15 am

Welcome

Session Moderator: Hajar MOUSANNIF, Associate Professor, Cadi Ayyad University, Morocco

9:15 am to 10:00 am

Keynote by Alessandro Cremonesi - STMicroelectronics

10:00 am to 10:30 am

Hardware Acceleration and Model Compression

Session Moderator: Manuel ROVERI, Associate Professor, Politecnico di Milano

Network Projection for AI Model Compression Applied to Embedded Acoustic Sensing

Antoni WOSS, Senior Software Developer , MathWorks

Abstract (English)

Deploying increasingly large and complex deep learning networks onto resource-constrained devices is a growing challenge facing many AI practitioners, especially within the domain of audio and acoustic applications. Modern deep neural networks, which are integral to advancing state-of-the-art signal processing algorithms, typically require high-performance processors and/or GPUs due to their extensive number of learnable parameters. As these large AI models set new benchmarks for quality and functionality, they simultaneously push the boundaries of what can be embedded into real-time systems and edge devices. Consequently, engineers today are faced with the critical task of reconciling the
complexity of these networks with the stringent resource limitations of portable devices and low-power sensors, while ensuring real-time performance without compromising accuracy. Deploying these powerful deep learning models onto edge devices often requires compressing them to reduce runtime memory and inference times, while attempting to retain high accuracy and model expressivity.

In this talk, we introduce a new technique: network projection, and utilize this to compress
and embed a sound-based machine health classification network. Network projection leverages in-distribution data to elicit a neural response from network activations. This approach then analyzes the covariances between these neural responses and modifies layer operations to work within a lower-rank projective space, which, in turn, significantly reduces the quantity of learnable parameters. Despite the operations occurring within a lower-rank projective space,
the layers maintain a high level of expressivity due to the preservation of width—that is, the number of neural activations—matching that of the original network architecture. The application of the projection technique differs between recurrent and non-recurrent layers but can be applied to both, e.g., fully connected, convolutional and LSTM layers, and can be used in place of or in addition to pruning and quantization.

In this talk, attendees will learn how to:
• Compress a deep learning model using a novel technique – network projection.
• Integrate and validate the compressed AI model into a larger MATLAB system design.
• Automatically generate embeddable C/C++ implementations for AI-powered designs.

10:30 am to 11:00 am

Break & Networking

11:00 am to 11:25 am

Hardware Acceleration and Model Compression - Part II

Session Moderator: Manuel ROVERI, Associate Professor, Politecnico di Milano

Hard Contextual Parameter Sharing: A Multi-task Learning Approach for tinyML

Michael GIBBS, PhD student in the Smart Sensing Lab, Nottingham Trent University

Abstract (English)

There is an increasing need to deploy complex models on resource-constrained devices such as microcontrollers for real-world deployment. However, multiple complex networks often exceed the limited memory and compute capabilities of these devices. This work proposes a novel Hard Contextual Parameter Sharing (HCPS) optimisation to enable efficient deployment of multiple neural network models on highly resource-constrained microcontroller devices. The approach deconstructs similar neural networks into shared layers that extract common features, and specialised layers that retain task-specific representations. By reusing network layers across models, the overall parameter count and storage requirements are diminished, enabling the implementation of more complex so-
lutions. Practical demonstrations on an Arduino Portenta H7 microcontroller board yielded a 49.5% reduction in storage burden. As shared layers increase, we find a corresponding increase in loss within the newly constructed model.
The proposed methodology enhances traditional multi-task learning for edge computing by selectively sharing parameters across tasks while maintaining con-text awareness. HCPS divides a large network into smaller ones, sharing specific parts via hard parameter sharing on early layers to extract universally relevant low-level features. This approach mitigates overfitting risks, enhancing gener-alisation across tasks, and improves scalability without significantly increasing parameters as tasks are added. The outcome is an ensemble of specialised, de-ployable models, reducing storage and computational demands compared to a singular heavy model.

An additional advantage of HCPS is the adaptability to deploy new spe-cialised models without retraining the entire network. This contrasts with tra-ditional multi-task learning, allowing shared layers for multiple specialised mod-
els without simultaneous training. The proposed approach finds application indiverse scenarios, such as energy management that adapts to seasonal patterns or context-based driver assistance systems that adjust to varying driving con-ditions.

Despite successful demonstrations, limitations persist, especially in deter-mining which layers to share. Two proposed approaches involve sharing only initial layers for high accuracy or sharing the majority with few specialised lay-
ers at the end for maximal storage savings. Assumptions include task-related similarities and non-overlapping classes, ensuring shared layers benefit perfor-mance. As machine learning extends to embedded devices, this study empha-
sises sustainable computing practices, illustrating how careful architecture and parameter sharing balance intelligence with resource constraints, contributing to reduced power consumption and enhanced environmental sustainability.

The contributions of this work are as follows:

• A novel deconstructed, shared neural network approach utilising contex-tual information to reduce the storage occupancy and computational bur-den of multiple deep learning models.

• Improved efficiency by reducing redundancy through selective neural net-work layer sharing, enabling deployment on ultra-low powered microcon-trollers.

• Enhanced adaptability compared to traditional multi-task learning, as spe-cialised sub-networks can be added without retraining the entire network, providing flexible expandability.

• A practical tinyML implementation of the proposed approach using two
widely recognised datasets illustrating the viability of the proposed method.

This modular ensemble model balances performance, adaptability, and sus-tainability under tight memory and power limits. The method is evaluated on an Arduino Portenta H7 microcontroller using MNIST and Fashion-MNIST datasets for digit and clothing classification tasks. The shared model com-bines common convolutional filters to reduce redundancy between networks by49.5%, while final task-specific dense layers preserve 97.9% and 74.4% accuracy
respectively. HCPS advances beyond conventional multi-task learning for edge devices by decomposing models into selective shared and specialised components to maximise on-device intelligence under constrained resources.

11:25 am to 12:25 pm

Posters and Demos

12:25 pm to 1:25 pm

Lunch & Networking

1:25 pm to 2:10 am

Keynote from Vijay Janappa Redi from Harvard University

Session Moderator: Martin CROOME, Vice President Marketing, GreenWaves

2:10 pm to 3:05 pm

Neuromorphic Computing

Session Moderator: Eiman KANJO, Provost’s Visiting Professor in Pervasive Sensing and tinyML , Imperial College London

Spiking Neural Processor T1: An ultra-low power neuromorphic microcontroller for the sensor edge

Katarzyna KOZDON, Staff Neuromorphic Engineer and Program Manager, Innatera

Abstract (English)

We present the world’s first ultra-low power neuromorphic microcontroller – the Spiking Neural Processor T1. Neuromorphic hardware adopts an approach inspired by the function and structure of the brain. A representation that particularly closely mimics the brain is a Spiking Neural Network (SNN), which are a class of event-based neural networks in which information is represented using precisely timed events (spikes). SNN models work by manipulating the timing relationships between these events, and leveraging temporal correlations between events to
identify patterns in the data. Key to these capabilities is the inherent notion of time built into the neurons and synapses of the SNNs. The time-varying states of neurons and synapses enable powerful temporal processing to be carried out even with small models, with sparse and efficient event-based communication between computing elements. SNNs enable rapid recognition of patterns in sensor data, in addition to complex signal processing, just like the brain does.

The T1 system-on-chip enables powerful neuromorphic processing of sensor data, within a singular always-on component. Integrating a triad of processing elements, the T1 incorporates a spiking compute engine for SNNs, an accelerator for CNNs, and a light-weight RISC-V CPU to provide application developers with a heterogeneous platform for sensor data processing. The mixed signal spiking compute engine enables SNN inference within a sub-milliwatt power envelope, for always-on low-latency processing of data streams. The architecture facilitates the capabilities of SNNs to be blended with conventional non-spiking neural networks to realize a broad range of application capabilities within the same device. Both accelerators are highly customizable in terms of the parameters and connectivity they support. As a companion to sensors, the T1 incorporates a RISC-V core that enables handling of multiple sensors, marshalling of data, as well as conditioning and pre-/post-processing of sensor data and inference results.

The T1 integrates an ample amount of memory for complex workloads at the sensor edge, and incorporates a diverse set of interfaces – QSPI, I2C, I2S, UART, JTAG, and GPIOs, in addition to a front-end ADC – ensuring compatibility with a vast range of sensors. Its 35-lead WLCSP package of 2.16x3mm allows it to be integrated into the most compact of edge applications, right next to the sensor. The T1 effectively serves as the first and only chip a sensor needs to interface with,
allowing actionable insights to be derived from raw data, right at the sensor edge.

The Talamo Software Development Kit serves as the gateway for application developers to use the T1. Enabling access to the novel capabilities brought by the T1’s mixed signal computing fabric, Talamo simplifies the process of application
development based on SNNs through its integration with PyTorch. This enables comprehensive development, optimization, and deployment of SNN and CNN models onto the T1 without having to deal with the complexity of its underlying compute architecture. An easy-to-use pipeline construction API and model zoo included within Talamo simplify the creation of end-to-end applications.

The T1 evaluation kit is now available to early customers for application trials.

As a follow-up to our previous presentations at tinyML events, we will walk through the architecture of our new T1 chip, and show how the chip and the Talamo SDK come together to enable power-efficient AI at edge applications. We will illustrate the above through an application example, and share insights into the power-performance of the Spiking Neural Processor platform in a context relevant to the tinyML community.

In-Glasses Eye-Tracking with DVS camera, Tiny Memory-Efficient Model for Neuromorphic Computing and Digital Processors

Michele MAGNO, Head of the Project-based learning Center, ETH Zurich, D-ITET

Abstract (English)

This presentation introduces an innovative neuromorphic methodology for eye-tracking, utilizing pure event data captured by a Dynamic Vision Sensor (DVS) camera. Our approach integrates a directly trained Spiking Neural Network (SNN) regression model and capitalizes on the cutting-edge, low-power edge neuromorphic processor – Speck. This combination aims to significantly enhance the precision and efficiency of eye-tracking systems.

Initially, we present the ‘Ini-30,’ a novel event-based eye-tracking dataset collected from thirty volunteers using two DVS cameras mounted on glasses. We then describe our SNN model, named ‘Retina’, which is based on Integrate-and-Fire (IF) neurons. Remarkably, ‘Retina’ has only 64k parameters – 6.63 times fewer than the latest models – yet achieves a pupil tracking error of just 3.24 pixels on a 64×64 DVS input. The continuous regression output is derived by applying a non-spiking temporal 1D filter across the output spiking layer.

Furthermore, we evaluate ‘Retina’ on the neuromorphic processor, demonstrating an end-to-end power consumption of only 2.89-4.8 mW and a latency of 5.57-8.01 ms, depending on the time window. We benchmark our model against the latest event-based eye-tracking method, ‘3ET’, built on event frames. The results indicate that ‘Retina’ not only achieves higher precision, with 1.24px less error in pupil centroid estimation, but also boasts a significantly reduced computational complexity, requiring 35 times fewer Multiply-Accumulate (MAC) operations.

Additionally, I will present ‘DigitalRetina’, a TinyML model based on Yolo, optimized for minimal memory usage and evaluated on the GreenWaves GAP9 processor. This talk will offer a comprehensive comparison of the two approaches – neuromorphic and digital – in terms of energy efficiency, latency, and other critical metrics.

An under review paper with more information can be found here https://arxiv.org/abs/2312.00425

3:05 pm to 3:35 pm

Break & Networking

3:35 pm to 4:50 pm

Efficiency and Optimization in tinyML

Session Moderator: Valeria TOMASELLI, Senior Engineer, STMicroelectronics

tinyCLAP: distilling language-audio pretrained models

Francesco PAISSAN, Junior Researcher, Fondazione Bruno Kessler (FBK)

Abstract (English)

Contrastive Language-Audio Pretraining (CLAP) [1], and similarly, its image counterpart, CLIP [2], proved to be an effective technique to pretrain audio and image encoders. In particular, CLAP and some of its variants [3, 4] achieved state-of-the-art performance for sound event detection, showcasing impressive performance also in Zero-Shot (ZS) classification. However, one of the main limitations is the considerable amount of data required in the training process and the overall computational complexity during inference. In this presentation, we will review how we can reduce the complexity of contrastive language-audio pre-trained models, yielding an efficient model called tinyCLAP. We derive an unimodal distillation loss from first principles and explore how the dimensionality of the shared, multimodal latent
space can be reduced via pruning. tinyCLAP uses only 6% of the original Microsoft CLAP parameters with a minimal reduction (less than 5%) in zero-shot classification performance across the three sound event detection datasets it tested. Specifically, since the model capacity needed for learning the correlations between audio and text is high, the CLAP audio and text encoders are not suited for fast and low-footprint inference. To simplify the pipeline, we employ knowledge distillation [7] and pruning [8], as they proved to be effective techniques for learning smaller models while inheriting the representation capabilities of the teacher model. Standard knowledge distillation is not suited for the CLAP audio encoder because, in CLAP, there are no soft labels since the classification is performed using the learned similarity score – and thus, the number of classes is not selected a-priori. We will present how the knowledge distillation loss for CLAP can be formulated to preserve the similarity score between text and audio. We show that the distillation and pruning strategies can work with audio samples without text with this formulation.
Finally, we will showcase the zero-shot classifier on an ARM-Cortex M7-based board, analysing the benefits with respect to a standard classifier, and its complexity-performance tradeoff.
Attached paper: https://arxiv.org/pdf/2311.14517.pdf

Optimizing Vision Transformers: A Novel Neuron Leveraging Max and Min Operations for Aggressively Prunable DNNs

Philippe BICH, PhD Student, Politecnico di Torino

Abstract (English)

Deep Neural Networks (DNNs) are structures capable of solving complex tasks with the use of a massive number of trainable parameters. They are in such a large number that they result to be greatly redundant. Pruning is a necessary operation to achieve low-power and lightweight inference, which is fundamental in the field of Tiny Machine Learning (TinyML) for the implementation of neural networks on mobile devices with limited battery capacity and constrained computational capabilities. Classical pruning approaches in the literature [1] simply leverage methods to select the interconnections or entire neurons to be pruned in a DNN without modifying its inherent structure, i.e. neurons based
on the typical Multiply-and-ACumulate (MAC) paradigm. In standard MAC-based neurons, the output is computed by first modulating inputs independently of each other (map operation), then by aggregating the outcomes into a single quantity (reduce operation), and finally by reshaping the value through an activation function. For these classical neurons, map is multiply, while reduce is accumulate.
In a recent work [2], we have introduced an alternative neuron structure, whose adoption in a DNN
allows to achieve much better results when the entire network is pruned – virtually using any pruning
method allowed in the literature. More specifically, whereas the map-reduce paradigm adopted in a typical
neural network is based on a MAC operation, we propose to substitute accumulate with a maximum plus
minimum operation, obtaining a different structure for the neurons. We call this novel map-reduce paradigm
Multiply-And-Max/min (MAM).
The advantage of using this neuron (which we discuss in detail in [3]) is a much higher prunability of
the neural network that can be aggressively simplified by the removal of redundant interconnections while
retaining its original performance.

A Paradigm Shift From Imaging to Vision: Oculi Enables 600x Reduction in Latency-Energy Factor for Visual Edge Applications

Charbel RIZK, Founder and CEO, Oculi

Abstract (English)

Remarkable progress has been achieved in the field of artificial intelligence, particularly in the extensive use of deep neural networks, which have significantly enhanced the reliability of face detection, eye tracking, hand tracking, and people detection. However, performing these tasks still demands substantial computational power and memory resources, making it a resource-intensive endeavor that remains to be solved. Consequently, power consumption and
latency pose significant challenges for many systems operating in always-on, edge applications.

The OCULI SPU (Sensing and Processing Unit), ideal for TinyML vision applications, represents an intelligent, programmable vision sensor capable of configuration dynamically to output select data in various modes depending on use case needs. These modes include images or video, polarity events, smart events, and actionable information that make the vision sensor efficient. Moreover, the SPU allows real-time programmability of spatial and temporal resolution, as well as dynamic range and bit depth. By enabling continuous optimization, computer/machine vision
solutions deploying the OCULI SPU, in lieu of imaging sensors, can reduce the latency-energy factor by more than 600x at a fraction of the cost. Smart events and actionable information outputs and modes are distinctive features unique to the Oculi vision sensor.

To showcase tinyML capabilities, Oculi participated in the tinyML Hackathon 2023: Pedestrian Detection. Our initial results demonstrated an always-on solution (24/7) with a latency of less than 4 ms that only consumes 3 W-hr total for a whole year, equivalent to a single AA battery.
Because the OCULI SPU is fully programmable, the solution can be dynamically optimized between latency and power consumption. It will enable the first truly wireless battery-operated always-on vision products in the market. The presentation will provide an overview of Oculi’s novel vision architecture for edge applications, and also include key results for latency and energy results for multiple use cases of interest to the TinyML community including
presence/people/pedestrian/object, face, hand, and eye detection. Our results will also include a comparison with alternate or conventional solutions that demonstrate significant advantages in adopting a paradigm shift from imaging to vision for visual edge applications.

Structural Health Monitoring at the Edge for Wind Turbine Blades Using Pressure Sensors

Denis MIKHAYLOV, Researcher , D-ITET, ETH Zürich

Abstract (English)

Wind energy is an increasingly important component of the transition to renewable energy and away from fossil
fuels. Technological and physical limitations have led to wind turbines increasing in size and height in recent
years. Larger turbines with larger blades allow more energy to be produced more consistently, but come with
manufacturing and maintenance challenges. In particular, regular visual inspections are necessary to
detect defects or damage to the blades such as delamination, debonding or cracks, and damage due to
extreme weather. However, the trend towards tall, offshore wind turbines that are difficult to access makes
remote monitoring solutions desirable, which would reduce the need for visual inspections, improve safety,
and increase the uptime of the turbines. There have been numerous different approaches taken to monitor wind turbines, however until now most have mainly focused on modal analysis and vibration data [1]. Other approaches, such as acoustic monitoring using microphones or visual inspections using drones, have also been investigated, but currently there is no clearly optimal solution to monitoring and detecting faults. Therefore, significant challenges remain in functionality, reliability, cost, and ease of deployment [2]. To address these challenges, a data acquisition system was developed [3] which can be adhesively fitted to the surface of wind turbine blades featuring 40 MEMS barometers, 10 microphones, 5 differential pressure sensors and a MEMS IMU. A TI CC2652P microcontroller running at 48 MHz with an ARM Cortex M4F core is used for acquisition, power management and processing, while a
BLE connection with a bandwidth of 1.2 Mbps enables connectivity for data transmission and device
configuration. A flexible solar panel provides power that is stored in a battery. The entire system is
less than 4 mm thick allowing it to be mounted to the surface of the turbine blades without significantly affecting aerodynamic performance.
This system has already been successfully deployed on a wind turbine [3] as shown in Figure 1, where
data is acquired in periodic 10-minute acquisition windows every two hours, before being transmitted
to a remote server for further processing and analysis. The deployed system does not perform any
processing onboard. This imposes significant latency on the system and consumes large amounts of
energy due to the amount of raw sensor data that must be transmitted, generated at a rate of 4.2 Mbps.
More recently, investigations have been conducted in a wind tunnel to allow data from the sensors to
be collected in controlled conditions. The test setup for this work is shown in Figure 2 (a). An airfoil
is placed on a metal cantilever beam and a crack with increasing size (in 5 mm increments from 5 to
20 mm) is sawn into the cantilever to simulate a blade fault. The system is excited using a motor at
the opposite end of the blade to better simulate the effects of turbulence and the system is exposed to

different wind speeds from 10 to 20 m/s. In the healthy case with no crack, some experiments are also
performed with an additional heaving mass to simulate different load and weather conditions.
Using this data, a tinyML classifier was trained and deployed to the Aerosense system that classifies
crack size. The classifier uses only the pressure data from the 40 barometers and is trained using the
open-source XGBoost library, before being converted to C using m2cgen and deployed on the MCU.
It uses 4 simple statistical features (skew, mean kurtosis, variance) and one energy feature based on
512-sample windows of data (approximately 5 s at 100 Hz). The energy feature approximates the
energy of the modal frequency and is calculated using a narrow FIR bandpass filter tuned to the modal
frequency with 200 taps followed by a Hann window. An accuracy of 83% is achieved when using
a 66-33 training-test split and stratified 10-fold cross-validation.
The tinyML model consumes just 9.068 mJ per inference and has a latency of just 579 ms, meaning
a fault can be detected in less than 6 s after it has occurred. It requires 3 kB of RAM and 35 kB of
flash memory. The confusion matrix in Figure 2 (b) shows that the model reliably classifies the crack
class. The table in Figure 2 (c) shows the energy consumption and latency of the processing pipeline
on the MCU. The energy feature requires the most energy and consumes the most time. However,
without it the classification performance of the model drops significantly. While there are limitations

to the supervised classification approach and the representativity of the wind tunnel data set of real-
world conditions, these results demonstrate that such damage classification is possible using only

pressure sensors and onboard processing capabilities. The classifier has the potential to be applied as
part of an event-based data acquisition strategy, where sensor node data is continuously monitored
onboard and data is only transmitted to the remote processing system if a fault is suspected. Since the
data transmission makes up a significant proportion of the energy consumption and latency of the
Aerosense system (transmitting a single 5 s window of only the barometer data would require 119 mJ
of energy and 1s), such an approach would offer significant energy savings, without compromising
the usefulness of the data collected.
REFERENCES
[1] Zonzini, F., Malatesta, M. M., Bogomolov, D., Testoni, N., Marzani, A., & De Marchi, L. (2020). Vibration-based
SHM with upscalable and low-cost sensor networks. IEEE Transactions on Instrumentation and Measurement,
69(10), 7990-7998.
[2] M. Rezamand, M. Kordestani, R. Carriveau, D. S.-K. Ting, M. E. Orchard, and M. Saif, “Critical wind turbine
components prognostics: A comprehensive review,” IEEE Transactions on Instrumentation and Measurement, vol.
69, no. 12, pp. 9306–9328, 2020.
[3] Polonelli, T., Deparday, J., Abdallah, I., Barber, S., Chatzi, E., & Magno, M. (2023). Instrumentation and
Measurement Systems: Aerosense: A Wireless, Non-Intrusive, Flexible, and MEMS-Based Aerodynamic and
Acoustic Measurement System for Operating Wind Turbines. IEEE Instrumentation & Measurement Magazine,
26(4), 12-18.

7:00 pm to 10:00 pm

Social event

You need to be registered for the social event

9:00 am to 9:05 am

Welcome

Session Moderator: Martin CROOME, Vice President Marketing, GreenWaves

9:05 am to 9:50 am

Keynote by Diana Trojainello - Luxottica SPA

Session Moderator: Martin CROOME, Vice President Marketing, GreenWaves

9:50 am to 10:20 am

Efficiency and Optimization in tinyML - Part II

Session Moderator: Andrea DUNBAR, Head of Sector Edge AI and Vision, CSEM

Streamlining ML Operations for Enhanced Collaboration and Efficiency

Alessandro GRANDE, Head of Product, Edge Impulse

Abstract (English)

In the era of expanding tinyML applications across diverse environments, there is a heightened  need for robust and customizable models. This presentation introduces a novel MLOps solution,  reshaping the ML workflow to enable teams to efficiently develop, deploy, and monitor models  tailored using real-world field data.

Our approach simplifies the development of base ML models, which serve as a robust starting  point for further customization. This MLOps solution empowers product engineering teams with a  streamlined process for model development, validation, and refinement, specifically tailored to  production environments. Emphasizing scalability and adaptability, our solution ensures efficient  customization to meet the unique requirements of various use cases.

The presentation explores how our MLOps solution caters to field engineers deploying and  monitoring models in real-world environments. Leveraging on-device testing capabilities,  engineers can assess and customize the model for optimal functionality in dynamic settings. Once  the model is in production this flow ensures continuous monitoring and refinement, allowing  teams to track and fine-tune model performance over time, guaranteeing sustained accuracy and  relevance throughout the operational life of ML models.

Join us to delve into the practical applications and tangible benefits of this transformative MLOps  workflow for production and field engineering teams. Witness firsthand how this streamlined  workflow enhances collaboration, accelerates deployment cycles, and maximizes the value  derived from machine learning in real-world scenarios.

10:20 am to 10:50 am

Break & Networking

10:50 am to 11:40 am

Efficiency and Optimization in tinyML - Part III

Session Moderator: Andrea DUNBAR, Head of Sector Edge AI and Vision, CSEM

Deploying neural networks in sensors with near zero memory budget

Danilo PAU, Technical Director, IEEE & ST Fellow, System Research and Applications, STMicroelectronics

Abstract (English)

Sensors are evolving from pure measurement devices to the capability to provide natively intelligent information for several industrial and consumer use cases. AI is the key to achieving this capability, but it is challenged by the deployable constraints mainly driven by the lack of the embedded memory, which is essentially RAM based and tightly integrated into the sensor itself. Careful use of this memory is crucial to enabling AI acceleration within the same power budget as the sensor itself. Moreover, machine learning engineers are required to provide solutions at the
highest productivity level therefore calling for tools that automate the machine learning algorithms design. Neuton.AI and ST have devised a groundbreaking end-to-end methodology to address these challenges, providing practical solutions for machine learning developers focused on sensor computing by utilizing the ISPU (intelligent sensor processor unit) technology.
During the session, we will present three use cases illustrating the application of these technologies in practice:
• Hands-free interaction for smartwatch: the solution can recognize five gestures and
consumes only 9.3 Kbyte of program RAM
• On-device package tracking for logistics the solution can track seven package states and
consumes only 9.3 Kbyte of program RAM
• Smart ring remote control: the solution can recognize eight gestures and consumes only
10 Kbyte of program RAM

6 years of open-source TinyML with emlearn

Jon NORDBY, CTO, Soundsensing

Abstract (English)

The availability of accessible, high-quality and open software is critical for the adoption of any new technology and paradigm. The area of TinyML is no exception. In 2018, when we started researching and developing in this area, there were not much software for Machine Learning on microcontrollers. This is why we started the emlearn project – an open-source Python library that allows converting scikit-learn and Keras models to efficient C code.
The goal of the project is to make it easy to deploy efficient models to any microcontroller with a C99 compiler, while keeping a Python-based workflow that is familiar to Machine Learning Engineers.
Over the years the library has been used in a wide range of applications; from detection of vehicles using acoustic sensor nodes, to tracking wellbeing of grazing cows using accelerometers, to hand gesture recognition based on skin electrical activity, to real-time intrusion detection in IoT networks.
In this presentation, we will showcase a selection of usecases and discuss the impact that the library has managed to have so far. We will also cover some of the latest improvements that are aimed to increase the future impact, such as improved documentation, integrated tooling for model size evaluation, and support for MicroPython.

11:40 am to 12:40 pm

Posters and Demos

12:40 pm to 1:40 pm

Lunch & Networking

1:40 pm to 2:25 pm

Keynote Francesco Conti - University of Bologna

Session Moderator: Tomas EDSÖ, Senior Principal Design Engineer, Arm

The Quest for Open-Source tinyML Heterogeneous Hardware Acceleration: A 10+ Year PULP Journey

Abstract (English)

In the last few years, our perception of what constitutes a “tinyML device” has shifted from simple microcontrollers to complex heterogeneous SoCs suited to execute DNNs directly at the extreme edge in real time and at minimal power cost. These devices provide ultra-low latency and high energy efficiency necessary to meet the constraints of advanced use cases that can not be satisfied by cloud solutions. However, how can tinyML hardware keep up with the evolution of the AI landscape, continuously pushing towards much larger and more complex models? The costs to develop new accelerators and Neural Processing Units for each evolutive step in AI are hard to sustain. A possible way forward is given by the open-source model for digital hardware, popularized by RISC-V: multiple actors – both academic and industrial – collaborate on the development of digital technology that can benefit all parties. In this keynote, I present a 10+-year “quest” to push the performance and energy efficiency of tinyML further and further by exploiting a fully open-source model based on the PULP Platform initiative. I show how the open-source cooperative model makes it possible to combine different ideas and contributions in a technologically portable way, acting as an innovation catalyst and enabling the fast pace of evolution required to keep up with new ideas in AI within a tiny power budget.

2:25 pm to 3:20 pm

tinyML 4Good

Session Moderator: Thomas BASIKOLO, Programme Officer, ITU

Smart and Tiny UWB-radar Sensors: a Distributed Solution for Enhanced Home Care

Massimo PAVAN, PhD Student, Politecnico di Milano

Abstract (English)

As our global population ages, the demand for effective and personalized elderly care is on the rise. While societies struggle to find qualified people to perform this type of job, the technology for elderly care is advancing rapidly, integrating Machine Learning (ML) and robotic solutions to help ease the jobs of the carers [1]. The application in
elderly care of ML, in particular, goes beyond mere innovation; it promises to enhance the quality of life for the elderly by providing personalized and proactive solutions to address their unique needs. Despite the promising results achieved by this type of technology, they are still being adopted at a slow pace. Among the factors that contribute to this result are the privacy concerns raised by these solutions and the aversion that the final users feel towards technologies that are perceived as limiting of their personal freedom [2] [3].

The deployment of ML technologies, in fact, often involves the collection and anal-ysis of vast amounts of personal data, including health records, daily activity patterns, and even biometric information. While these data-driven insights can contribute to personalized and effective care, they also raise serious privacy concerns. Finding the right balance between leveraging data for improved care and safeguarding individual privacy is a critical challenge that needs thoughtful and robust solutions.

At the same time, a crucial aspect in the design of technological solutions is their perception among the seniors. Technologies in this field, in fact, in order to be used and accepted by the final user, require to be non-invasive, discrete and respectful of each one habits and lifestyle. Wearables and smart cameras are examples of technologies
that showed this kind of rejection from the elderly population: wearables require, in fact, to be constantly carried by the users in order to work, while cameras are often perceived as too invasive of the final user privacy.
In this context, we propose an innovative UWB-radar-based distributed solution employing in-sensor TinyML, with the objective of enhancing privacy while taking into account the perception of older users.

Ultra-wide-band (UWB) is an innovative radar technology mainly used for short-range communication and imaging [4]. UWB radar systems typically operate at low power levels and work on a wide frequency spectrum, enabling high resolution and precision in applications like imaging and localization. UWB technology is of partic-
ular interest in the context of home care since images produced by this type of sensor do not allow for recognition of the identity of a target, but are precise enough to track their activities. UWB-radar devices employing this type of sensors can be very discrete, considering they can even be embedded inside of walls. Nevertheless, when applied to embedded devices, UWB-radar produces high-dimensional and noisy data that require deep analyses and processing in order to be fully exploited. For this reason, TinyML [5] is used for the in-sensor analysis [6] of UWB-radar data, making the solution presented in this paper the first smart UWB-radar sensor.

Meet the Acoustic Smart Weather Station Aurora

Jona BEYSENS, R&D Engineer, CSEM

Abstract (English)

In this talk, we will present Aurora, a low-power, intelligent, low-cost, and easy-to-install weather sta-on. By listening to the surrounding environment, Aurora es-mates the rain and wind intensity without the need for any mechanical moving part. To accomplish this, Aurora collects audio signals with an integrated microphone and processes this data through ultra-low-power Machine Learning (-nyML) at the edge. Aurora aims to unleash the poten-al of tinyML for positive change by crea-ng a reliable, maintenance-free and low-cost weather staton.

3:20 pm to 3:50 pm

Break & Networking

3:50 pm to 4:50 pm

Panel - Challenges of tinyML Applications

Session Moderator: Martin CROOME, Vice President Marketing, GreenWaves

6:30 pm to 9:00 pm

VIP Dinner Networking

You need to be registered for the dinner

9:00 am to 9:05 am

Welcome

Session Moderator: Hajar MOUSANNIF, Associate Professor, Cadi Ayyad University, Morocco

9:05 am to 9:50 am

Keynote from Bosch

Session Moderator: Hajar MOUSANNIF, Associate Professor, Cadi Ayyad University, Morocco

Between two worlds, high performance and ultra low power AI: challenges, trends, and solutions

Abstract (English)

Neural networks are presented in the last few years as an effective candidate for various tasks targeting a wide range of products. From tiny devices all the way up to the autonomous driving, one of the main concerns is the energy efficiency which goes in hand with the environmental footprint. Increasing performance requirements, energy efficiency together with the rapid development of new neural network models and architectures are some of the main challenges. Leveraging achievements from different disciplines like emerging memory technologies, NPU architectures and hardware aware optimization frameworks can be an enabler to utilize new and powerful neural networks in the embedded domain. In this talk, the challenges, and trends will be explored and discussed from industrial viability and research promise.

9:50 am to 10:20 am

Efficiency and Optimization in tinyML - Part IV

Session Moderator: Dirk STANEKER, Group Leader, Bosch Sensortec GmbH

AutoStreamLib: Efficient execution of Temporal Convolutional Networks through Seamless Transformation from Non-Streaming to Streamable Inference

Seyed Ahmad MIRSALAR, PhD Student, University of Bologna

10:20 am to 10:50 am

Break & Networking

10:50 am to 11:15 am

TBA

11:15 am to 12:15 am

Demos & Posters

12:15 am to 1:15 pm

Lunch & Networking

1:15 pm to 2:35 pm

Hardware Acceleration and Model Compression - Part II

Session Moderator: Sumeet KUMAR, CEO, Innatera

Ultra-Efficient On-Device Object Detection on AI-Integrated Smart Glasses with TinyissimoYOLO

Michele MAGNO, Head of the Project-based learning Center, ETH Zurich, D-ITET

Abstract (English)

Smart glasses are rapidly gaining advanced func-tionality thanks to cutting-edge computing technologies, acceler-ated hardware architectures, and tiny Artificial Intelligence (AI) algorithms. Integrating AI into smart glasses featuring a small form factor and limited battery capacity is still challenging when targeting full-day usage for a satisfactory user experience. This paper illustrates the design and implementation of tiny machine-learning algorithms exploiting novel low-power processors to enable prolonged continuous operation in smart glasses. We explore the energy- and latency-efficient of smart glasses in the case of real-time object detection. To this goal, we designed
a smart glasses prototype as a research platform featuring two microcontrollers, including a novel milliwatt-power RISC-V parallel processor with a hardware accelerator for visual AI, and a Bluetooth low-power module for communication. The smart glasses integrate power cycling mechanisms, including image and audio sensing interfaces. Furthermore, we developed a family of novel tiny deep-learning models based on YOLO with sub-million parameters customized for microcontroller-
based inference dubbed TinyissimoYOLO v1.3, v5, and v8, aiming at benchmarking object detection with smart glasses for energy and latency. Evaluations on the prototype of the smart
glasses demonstrate TinyissimoYOLO’s 17ms inference latency and 1.59mJ energy consumption per inference while ensuring acceptable detection accuracy. Further evaluation reveals an end-to-end latency from image capturing to the algorithm’s prediction of 56ms or equivalently 18 frames per seconds (FPS), with a total power consumption of 62.9mW, equivalent to a 9.3 hours of continuous run time on a 154mAh battery. These results outperform MCUNet (TinyNAS+TinyEngine), which runs
a simpler task (image classification) at just 7.3 FPS per second.

Sensorless Pattern Recognition of PMSM/BLDC Motors against Aerodynamic, Uncontrollability and Asymmetrical Load Influence at the Extreme Edge using Intelligent Neural Processing

Subharshi ROY, Associate Software Engineer, STMicroelectronics Pvt. Ltd.

Abstract (English)

This paper presents an innovative framework designed to diagnose and analyze electric motor-actuated control systems against external and internal influences such as an aerodynamic influence, uncontrollability due to control parameters mismatch and Asymmetric load using only the current and voltage measurements employed in traditional motor control algorithms at the extreme Edge. Any extra physical inertial, motion or microphone-based sensors are not employed. The key component of this paper is an edge-enabled deep learning-based sensorless state estimation/pattern
recognition methodology that harvests the Motor Control – Field Oriented Control Algorithm’s control electrical signatures as pre-feature engineered data to diagnose the motor’s health and operating conditions against, Aerodynamic influences subjected by the motor i.e. (in this case) BLDC/PMSM motor coupled with quadcopter propeller subjected to the disturbance caused aerodynamically by an external (another) BLDC/PMSM motor causing a non-linear dynamic disturbance. The Uncontrollability effect originated from misadjustment in control parameters
cause minor or major uncontrollability-induced performance which affects the long-term motor health. Operating in this condition gives rise to dynamic non-deterministic irregular patterns. Asymmetrical load gives rise to rotary misalignment and contributes to motor health degradation leading to the generation of abnormal vibrations. These external influences confronted by the electric motor in operation causes an impact on the electrical signatures employed by the Field Oriented Control Algorithm.
The core setup employs a conventional motor drive system (STDES-PTOOL3A) equipped with an STM32G4 ARM Cortex M4 microcontroller, designed to control the BLDC/PMSM motor. The electrical signatures originate from the conventional motor driver’s control algorithm such as Field Oriented Control (FOC) provided by existing frameworks such as STMicroelectronics MCSDK. Field-oriented control depends on sensing the 3-phase currents and estimation of rotor position. The algorithm performs a 3-phase to 2-phase AC (Clarke) transformation and Stationary D-Q (Park)
transformation. A close loop control is performed to orient and control torque-producing q-axis current and minimize d-axis current. On top of this, the controller has the capability to sense backemfs produced in the motor windings. All these datasets are used for pattern identification, and classification of the external and internal influences confronted by the electric motor in operation, thereby diminishing the requirements of any additional inertial, motion or audio-based sensors.
Platforms with more computational capability are used for the analysis and training of these datasets.
However, the traditional controller can utilize the learning to implement control as well as classification algorithms on the traditional edge device. These control parameters take the major load of performing feature engineering. The system makes use of these external influences confronted by the electric motor in operation, to understand the dynamic behaviour caused in real-time enriching unique signal patterns henceforth diminishing the requirements of any inertial, motion or audio-based sensor. Also, the system uses existing hardware in operation and existing parameters for
pattern recognition and anomaly detection. The ability to infer from the external aerodynamic influence acting upon the motor in application enables a concept where we state that the motor in the application itself acts as a sensor. Similarly, any anomaly or effect that deviates from the motor from healthy operation can be detected early before it causes any major damage, all credit to the ability of the motor driver to perform real-time intelligent neural inference at low latency and high efficiency. A Quantization-Aware trained Deep Convolutional Neural Network (DCNN) algorithm strategically running on the extreme edge, motor drive controller constitutes the cornerstone of this sensorless
estimation process. The role of the FOC algorithm running on the PTOOL3A motor drive system to provide stable operating conditions is indispensable. The developed intelligent algorithm runs in parallel on the same controller to diagnose and classify the motors’ dynamic responses to external influences enhancing the application of Edge AI for condition monitoring and pattern recognition. The implementation of Deep CNN architecture in the custom layer Keras with Tensorflow backend has been performed with abundant time experimenting to tune the hyperparameters and perform flexibly at different data patterns with efficient and high-precision pattern recognition. A set of
experiments with multiple data patterns to achieve a perfect balance between low latency and high precision at the edge device were conducted to recognize anomalies and patterns achieving a remarkable level of Precision, Recall and F1-Score, leveraging on parameters used by the traditional motor control algorithm. The utilization of electric motors and motor controllers span a vast array of applications, demonstrating their versatility and significance. The prospect of advancing these technologies to enhance the motor with the capability to function as a sensor equipped with
intelligent observability capabilities holds the potential to substantially contribute to the ongoing cyber-physical industrial revolutions.

Accelerating Quantized DNN Inference through Precision-Scalable Multipliers Integrated in Hardware Accelerators and RISC-V Processors

Luca URBINATI, PhD Student, Politecnico di Torino

Abstract (English)

Research Context and Motivation: Deep Learning (DL) algorithms, crucial for a myriad of applications, heavily rely on Multiply-and-Accumulate (MAC) units to perform essential operations such as convolutions and matrix multiplications. Furthermore, since many DL applications require low inference time, faster MAC units and hardware accelerators are needed. In the context of edge devices, there is also a pressing need for quantization of activations and weights to reduce energy consumption and inference latency.

Addressed Research Questions/Problems: Our work delves into the realm of Mixed-Precision Quantization (MPQ), which explores the optimal bit precision for activations and weights of Deep Neural Networks (DNNs) layers. Practical implementation of MPQ demands precision-scalable (PS) and reconfigurable hardware. Despite the many PS alternatives [1], those exploiting Sum-Together (ST) and Sum-Apart (SA) multipliers (Fig. 1A) in DNN hardware accelerators [1] and RISC-V cores [2] are only few.

Novel Contributions:
1. We conducted a comprehensive comparison of the state-of-the-art ST multipliers, evaluating their
power, performance and area (PPA) characteristics through a design space exploration (DSE).
2. We expanded the portfolio of accelerators based on ST multipliers [1], proposing three novel implementations: 2D-Convolution (2D-Conv), Depth-wise Convolution (DW-Conv), and Fully-Connected (FC). High-Level Synthesis (HLS) is employed for the first time to generate PS hardware accelerators.
3. Until now SA and ST multipliers have been proposed as alternative implementations. Instead, we
proposed a new Sum-Together/Apart Reconfigurable (STAR) multiplier, capable of operating either
in SA or ST mode. Based on a Baugh-Wooley (BW) architecture (Fig. 1B), its partial product matrix
(Fig. 2B) can be configured to execute: N=1 scalar multiplication, or N=2, 4 parallel low-precision
multiplications for both SA and ST operating modes, with 16 / N-bit operands (Tab. 1B). STAR
enables a more efficient utilization of hardware resources, by dynamically sharing them to perform
different tasks, such as both 2D-Conv and DW-Conv in a single hardware accelerator.
4. We are the first to embed STAR in the MAC unit of a low-end RISC-V CPU. We replaced the default
16-bit multiplier inside the MULT/DIV unit of the Ibex processor [4] and added new MAC
instructions: standard 32-bit and 16/8/4-bit MAC operations in ST/SA mode.

Adopted Methodologies and Results:
1. The DSE of ST multipliers, performed varying the clock frequency from 100 to 1000 MHz and
synthesizing the multipliers on a 28-nm 0.9V CMOS technology, established the Pareto optimality of
the Booth ST multiplier in terms of area across all frequencies (Fig. 2A).
2. We made a wide DSE for each ST accelerator, using our HLS flow (Fig. 3A) and exploring many
hardware parameters (e.g. the MAC units parallelism). The DSE enables designers to choose the
optimal accelerator for their PPA targets.
3. We showcased the advantages of ST- based accelerators when integrated into System-on-Chips (SoCs)
in three different scenarios (low-area, low- power, and low-latency). The case study involved running
inference on MLPerf Tiny models quantized in mixed-precision (MP). The results showed a significant
latency and energy reduction with negligible area overhead (0.9%–8.0%) across the three scenarios,
when compared to SoCs with accelerators based on fixed-precision 16-bit multipliers (Tab. 1A).
4) At the cost of limited overhead in area (<10%) and power (<3%) compared to the original Ibex, our
modified Ibex showed an acceleration up to 4.5x for FC and 3x for 2D-Conv layers, configuring STAR
MAC in ST mode, and up to 2.3x for DW-Conv layers, configuring STAR MAC in SA mode.
Future Work: In the future, we plan to develop new hardware accelerators harnessing the potential of STAR
multipliers to accelerate MP-quantized DL workloads.

2:35 pm to 3:05 pm

Break & Networking

3:05 pm to 4:00 pm

On Device Learning

Session Moderator: Theocharis THEOCHARIDES, Associate Professor, University of Cyprus

Training on the Fly: On-device Self-supervised Learning aboard Nano-UAVs within 20mW

Elia CEREDA, PhD Student, IDSIA, USI-SUPSI

Abstract (English)

Pocket-sized autonomous unmanned aerial vehicles (UAVs) leveraging tiny machine learning (TinyML) perception pipelines represent an upcoming technology with many valuable applications in the Internet-of-Things domain. Thanks
to their tiny form factor (i.e., ∼10 cm diameter), they can fly and land in small/cluttered spaces and safely operate near
humans, acting as universal/ubiquitous dynamic smart sensors. However, TinyML algorithms for small UAVs equipped with onboard ultra-low power class processors suffer from the inherent domain shift problem, i.e., perception performances drop when moving from the training domain to a different deployment/testing one. To cope with and mitigate this general problem, we present a novel on-device fine-tuning approach that relies only on ultra-limited memory and computational capabilities. We build our study, analysis, deployment, and field testing on top of a real-world robotic application of vision-based human pose estimation. Our 512-image on-device training requires only 19 mW, 1 MB of memory, and runs in only 510 ms (five epochs), employing a GWT GAP9 System-on-
Chip. Finally, we propose a self-supervised technique based on an ego-motion consistency loss to handle the absence of training labels aboard our tiny UAV. In-field results of our closed-loop systems show an improvement in the horizontal mean pose error vs. a non-fine-tuned baseline, up to 26%, and compared to State-of-the-Art previous work, our on-device learning makes the difference between failing and succeeding in a never-seen before challenging environment.

Unleashing the Potential of TinyML Directly on Inteligent Sensors

Ahmed BENMESSAOUD, Researcher, Innovation Academy Mila

Abstract (English)

The year 2023 was a key year for tinyML unleashing a new age of intelligent sensors pushing intelligence from the
MCU into the source of the data at the sensor level, enabling them to perform sophisticated algorithms and machine learning models in real-time. This presentation delves into the implementation of TinyML solution provided by the team of Innovation Academy; a non-profit in Algeria North Africa that won the first place in the IEEE COINS 2023 Contest for In-Sensor Machine Learning Computing. Based on ST Microelectronics Intelligent Sensor Processing Unit (ISPU) coupled with highly optimized machine learning model developed using Neuton.ai tinyML platform to create a an ultra tiny general use case human activity recognition model. This application showcases the industrial potential and
feasibility of tinyML solutions implemented directly on the sensor, which can lead into revolution in industrial vibration monitoring and predictive maintenance domain. In this study, we created a general use case HAR model spanning up to 24 classes achieving an accuracy of 83.16% on the test set running only at 2kb of stack memory thanks to the capacity of the platform. The implemented approach permits running DSP features algorithms and complex machine learning models written in C language directly on the ISPU core leaving room for the MCU to perform other tasks. Moreover, the ISPU operates on ultra-low power using only 0.5 mA pushing the benefits of TinyML to their limits. This application demonstrates the feasibility of running tinyM L model inference directly on the sensor, highlighting its ability to operate within the limited computational resources on the sensor while still delivering high performance and power efficiency.
This, opens up new possibilities for the industrial application of tinyML such as industrial vibration monitoring and predictive maintenance.

4:00 pm to 4:05 pm

Closing

Schedule subject to change without notice.

Committee

Hajar MOUSANNIF

Chair

Cadi Ayyad University, Morocco

Martin CROOME

Co-chair

GreenWaves

Thomas BASIKOLO

ITU

Andrea DUNBAR

CSEM

Tomas EDSÖ

Arm

Jack Ferrari

MathWorks

Evgeni GOUSEV

Qualcomm Research, USA

Alessandro GRANDE

Edge Impulse

Eiman KANJO

Imperial College London

Sumeet KUMAR

Innatera

Manuel ROVERI

Politecnico di Milano

Dirk STANEKER

Bosch Sensortec GmbH

Theocharis THEOCHARIDES

University of Cyprus

Valeria TOMASELLI

STMicroelectronics

Speakers

Alessandro CREMONESI

Keynote - Monday Morning

STMicroelectronics

Vijay JANAPA REDDI

Keynote - Monday Afternoon

Harvard University

Francesco CONTI

Keynote - Tuesday Afternoon

University of Bologna, Italy

Taha Soliman

Keynote - Wednesday Morning

Robert Bosch GmbH

Ahmed BENMESSAOUD

Innovation Academy Mila

Jona BEYSENS

CSEM

Philippe BICH

Politecnico di Torino

Elia CEREDA

IDSIA, USI-SUPSI

Michael GIBBS

Nottingham Trent University

Alessandro GRANDE

Edge Impulse

Katarzyna KOZDON

Innatera

Michele MAGNO

ETH Zurich, D-ITET

Denis MIKHAYLOV

D-ITET, ETH Zürich

Seyed Ahmad MIRSALAR

University of Bologna

Julian MOOSMANN

ETH Zürich

Jon NORDBY

Soundsensing

Francesco PAISSAN

Fondazione Bruno Kessler (FBK)

Danilo PAU

STMicroelectronics

Massimo PAVAN

Politecnico di Milano

Subharshi ROY

STMicroelectronics Pvt. Ltd.

Charbel RIZK

Oculi

Luca URBINATI

Politecnico di Torino

Antoni WOSS

MathWorks

Sponsors

( Click on a logo to get more information)