tinyML Talks: The Multilingual Spoken Words Corpus, a Massive Keyword Spotting Dataset

This talk will present the Multilingual Spoken Words Corpus (MSWC), a speech dataset of over 340,000 spoken words in 50 languages, with over 23 million audio examples. MSWC has many use cases, ranging from voice-enabled consumer devices to call center automation. The dataset is CC-BY licensed and free for academic research and commercial use. We will introduce applications of MSWC for few-shot keyword spotting and spoken term search tasks in low-resource languages, and share a brief tutorial on getting started with the dataset. We will also discuss how we automated the construction of our dataset and our self-supervised approach for detecting outlier samples.

Date

December 14, 2021

Location

Virtual

Contact us

liamE na etirW

Discussion

Forum

Schedule

Timezone: PST

The Multilingual Spoken Words Corpus, a Massive Keyword Spotting Dataset

Mark MAZUMDER, PhD Student

Harvard University

Mark MAZUMDER, PhD Student

Harvard University

Mark Mazumder is a PhD student in Vijay Janapa Reddi’s group at Harvard University. His research interests are in efficient machine learning techniques for small datasets. Prior to joining Harvard, Mark was an Associate Staff member at MIT Lincoln Laboratory, where he performed research in computer vision and robotics.

tinyML_Talks_Mark_Mazumder_211214

Schedule subject to change without notice.

Events

tinyML Talks

Meetups

Infos