Bridging the Gap: Using Deep Acoustic Representations to Learn Grounded Language from Percepts and Raw Speech

Nancy J. Delong

Mastering to realize grounded language—the language that happens in the context of, and refers to, the broader world—is a common place of exploration in robotics. The vast majority of present get the job done in this place however operates on textual data, and that boundaries the ability to deploy agents in real looking environments.

Digital analysis of the end-user speech (or raw speech) is a vital part in robotics. Image credit: Kaufdex via Pixabay, free license

Electronic examination of the conclude-consumer speech (or uncooked speech) is a crucial section in robotics. Impression credit score: Kaufdex via Pixabay, free license

A new article published on arXiv.org proposes to get grounded language specifically from conclude-consumer speech utilizing a relatively compact selection of data factors rather of relying on intermediate textual representations.

A comprehensive examination of organic language grounding from uncooked speech to robotic sensor data of each day objects utilizing state-of-the-artwork speech illustration versions is delivered. The examination of audio and speech attributes of unique contributors demonstrates that finding out specifically from uncooked speech improves functionality on customers with accented speech as when compared to relying on automated transcriptions.

Mastering to realize grounded language, which connects organic language to percepts, is a crucial exploration place. Prior get the job done in grounded language acquisition has centered primarily on textual inputs. In this get the job done we exhibit the feasibility of performing grounded language acquisition on paired visual percepts and uncooked speech inputs. This will permit interactions in which language about novel tasks and environments is learned from conclude customers, reducing dependence on textual inputs and likely mitigating the results of demographic bias identified in widely available speech recognition techniques. We leverage new get the job done in self-supervised speech illustration versions and demonstrate that learned representations of speech can make language grounding techniques far more inclusive in direction of particular groups whilst preserving or even increasing general functionality.

Investigation paper: Youssouf Kebe, G., Richards, L. E., Raff, E., Ferraro, F., and Matuszek, C., “Bridging the Gap: Utilizing Deep Acoustic Representations to Understand Grounded Language from Percepts and Uncooked Speech”, 2021. Hyperlink: https://arxiv.org/abs/2112.13758


Next Post

College of Business Faculty Nationally Recognized as Engineering Unleashed Fellow

Jonathan Leinonen, principal lecturer in the Michigan Technological University Faculty of Business, is a 2021 Engineering Unleashed Fellow. This year, 27 folks from better education institutions across the place have been named fellows by Engineering Unleashed (EU). The designation recognizes management in undergraduate engineering education. The honor, which encompasses study […]