This talk will present the Multilingual Spoken Words Corpus (MSWC), a speech dataset of over 340,000 spoken words in 50 languages, with over 23 million audio examples. MSWC has many use cases, ranging from voice-enabled consumer devices to call center automation. The dataset is CC-BY licensed and free for academic research and commercial use. We will introduce applications of MSWC for few-shot keyword spotting and spoken term search tasks in low-resource languages, and share a brief tutorial on getting started with the dataset. We will also discuss how we automated the construction of our dataset and our self-supervised approach for detecting outlier samples.
Schedule
Timezone: PST
The Multilingual Spoken Words Corpus, a Massive Keyword Spotting Dataset
Mark MAZUMDER, PhD Student
Harvard University
This talk will present the Multilingual Spoken Words Corpus (MSWC), a speech dataset of over 340,000 spoken words in 50 languages, with over 23 million audio examples. MSWC has many use cases, ranging from voice-enabled consumer devices to call center automation. The dataset is CC-BY licensed and free for academic research and commercial use. We will introduce applications of MSWC for few-shot keyword spotting and spoken term search tasks in low-resource languages, and share a brief tutorial on getting started with the dataset. We will also discuss how we automated the construction of our dataset and our self-supervised approach for detecting outlier samples.
Mark MAZUMDER, PhD Student
Harvard University
Mark Mazumder is a PhD student in Vijay Janapa Reddi’s group at Harvard University. His research interests are in efficient machine learning techniques for small datasets. Prior to joining Harvard, Mark was an Associate Staff member at MIT Lincoln Laboratory, where he performed research in computer vision and robotics.
Schedule subject to change without notice.