Mastering to realize grounded language—the language that happens in the context of, and refers to, the broader world—is a common place of exploration in robotics. The vast majority of present get the job done in this place however operates on textual data, and that boundaries the ability to deploy agents in real looking environments.
A new article published on arXiv.org proposes to get grounded language specifically from conclude-consumer speech utilizing a relatively compact selection of data factors rather of relying on intermediate textual representations.
A comprehensive examination of organic language grounding from uncooked speech to robotic sensor data of each day objects utilizing state-of-the-artwork speech illustration versions is delivered. The examination of audio and speech attributes of unique contributors demonstrates that finding out specifically from uncooked speech improves functionality on customers with accented speech as when compared to relying on automated transcriptions.
Mastering to realize grounded language, which connects organic language to percepts, is a crucial exploration place. Prior get the job done in grounded language acquisition has centered primarily on textual inputs. In this get the job done we exhibit the feasibility of performing grounded language acquisition on paired visual percepts and uncooked speech inputs. This will permit interactions in which language about novel tasks and environments is learned from conclude customers, reducing dependence on textual inputs and likely mitigating the results of demographic bias identified in widely available speech recognition techniques. We leverage new get the job done in self-supervised speech illustration versions and demonstrate that learned representations of speech can make language grounding techniques far more inclusive in direction of particular groups whilst preserving or even increasing general functionality.
Investigation paper: Youssouf Kebe, G., Richards, L. E., Raff, E., Ferraro, F., and Matuszek, C., “Bridging the Gap: Utilizing Deep Acoustic Representations to Understand Grounded Language from Percepts and Uncooked Speech”, 2021. Hyperlink: https://arxiv.org/abs/2112.13758